tripal / tripal_analysis_expression

Extension module for the Tripal toolset to show differential expression data. This module was made for Drupal 7, Tripal 3, and Chado 1.3.
GNU General Public License v2.0
4 stars 11 forks source link

Updated BioSample Importer #392

Closed spficklin closed 2 years ago

spficklin commented 2 years ago

This PR addresses issues with the BioSample loader:

Fix for #385

This was fixed by rewriting the loader to mimic the Tripal GFF3 loader such that batches of records are inserted together rather than at once.

Fix for #381, #383 and #384

The importer now uses a multi-step form. The first step, the user provides the input file, sets the organism and the analysis. On the second step, the user is presented with all of the attributes and they can set the controlled vocabulary for each one. To help ease the burden of setting these terms, a new NCBI_BioSample_Terms cv/db is added and is populated with terms from this XML file: http://www.ncbi.nlm.nih.gov/biosample/docs/attributes/?format=xml. This happens during install of the module and updated on existing installations via the update 7302.

Testing

  1. After pull this code, be sure to run a drush updatedb to get the database updates.
  2. Test loading your favorite biosample file (XML or CSV). I have tested the loader on a 475-sample Rice BioSample XML file downloaded from NCBI and a huge tab-delimited file of 10K samples that I created. All went in just fine, in a speedy fashion!
spficklin commented 2 years ago

@dsenalik I think you are using this module on your site. Would you mind testing this PR for me? I've got permission from the Staton Lab to merge in pull requests but I figure I should get someone else to review if possible.

dsenalik commented 2 years ago

I will get to this eventually, I am working on creating a dataset to load

spficklin commented 2 years ago

Great! Thanks @dsenalik . I have a new loader for expression data in PR #397 if you could do a test of that as well I'd appreciate it.

spficklin commented 2 years ago

@dsenalik any interest in reviewing this and the other 2 PRs? I need to merge them in to avoid some complicated merge conflicts. If you don't have time then I can merge and hope for the best and we can fix issue as they come up.

dsenalik commented 2 years ago

I have not done this yet because I wanted to find "real" data to test with. That was harder than I thought, I'll go ahead with some fake data on my test system.

spficklin commented 2 years ago

That would be great. I have it working with a 475 NCBI BioSample XML sample file list but I'm worried I may have broken it for other inputs so if it can work with what you want to load for your site I think would be a great test.

dsenalik commented 2 years ago
  1. pull #392 Line 240 of tripal_biomaterial/tripal_biomaterial.install is missing a comma at the end.
  2. Style issue, starting at line 258, indented too much (was that already like that?)
  3. Starting at line 312 need to indent 2 more spaces, and there is a missing closing brace for the foreach
  4. I will mention this in case I have done something horribly wrong, but I was only seeing update 7303, and 7301 needs to run to add cvalue etc. Probably something from my attempts to update this module earlier, so I went and reset to 7300 in the public.system table for this module.
  5. Update 7303 lots of errors of the form

    
    ERROR (TRIPAL_CHADO): chado_insert_record; Cannot insert duplicate record into cvterm table: Array
    (
    [cv_id] => 58
    [name] => SimpleXMLElement Object
        (
            [0] => treatment
        )
    
    [definition] => 
    [dbxref_id] => 443412
    [is_obsolete] => 0
    [is_relationshiptype] => 0
    )

[site http://default] [TRIPAL ERROR] [TRIPAL_CHADO] chado_insert_record; Cannot insert duplicate record into cvterm table: Array( [cv_id] => 58 [name] => SimpleXMLElement Object ( [0] => treatment ) [definition] => [dbxref_id] => 443412 [is_obsolete] => 0 [is_relationshiptype] => 0) sizeof(): Parameter must be an array or an object that implements Countable [warning] tripal.notice.api.inc:134 [site http://default] [TRIPAL WARNING] [TRIPAL_CV] Failed to insert the term: treatment (local)


Should line 407 be ```!$cvterm->find()``` ?
spficklin commented 2 years ago

Thanks for the review @dsenalik. Some comments in reply:

  1. Oops, that missing comma was a mistake I made after testing so I didn't notice it. It's now fixed.
  2. That indentation starting on line 258 is fixed. I failed to notice it was off...
  3. The problem with the missing brace was something I must have introduced in a merge conflict resolution. It's fixed now.
  4. I've reset the version number in the past as well and as long as the update hooks are programmed well it shouldn't cause a problem. The other thing I've done in the past is to use a drush command to test the update hook: drush php-eval 'module_load_include("install", "tripal_biomaterial", "tripal_biomaterial"); tripal_biomaterial_update_7303();'
  5. Well, that was a goof on my part some confusing code with updating terms from NCBI. I've fixed it and you should be able to get past the updates now.
  6. Oh, and line 407 is okay as it's looking to apply an update not an insert.

Sorry the install was so problematic. :0/

dsenalik commented 2 years ago

I have a situation. The old version of this module had code such as

  tripal_insert_cvterm([
    'name' => 'organism',
    'def' => 'The most descriptive organism name for this sample (to the species, if relevant).',
    'cv_name' => 'biomaterial_property',
    'db_name' => 'tripal',
  ]);

So I currently have the 'organism' cvterm in the 'biomaterial_property' cv linked with a dbxref to the 'tripal' db.

When I try to run update 7303 it loads some cvterms with code such as

chado_insert_cvterm([
    'id' => 'local:organism',
    'name' => 'organism',
    'definition' => 'The most descriptive organism name for this sample (to the species, if relevant).',
    'cv_name' => 'biomaterial_property',
    'db_name' => 'local',
  ]);

which produces the following error

ERROR (TRIPAL_CHADO): chado_insert_record; Cannot insert duplicate record into cvterm table: Array
(
    [cv_id] => 58    (this is the biomaterial_property cv)
    [name] => organism
    [definition] => The most descriptive organism name for this sample (to the species, if relevant).
    [dbxref_id] => 442945
    [is_obsolete] => 0
    [is_relationshiptype] => 0
)

inside tripal_chado.cv.api.inc in the function chado_insert_cvterm() the following query is used to find the cvterm, but it returns zero results, and that triggers the error above

SELECT CVT.name, CVT.cvterm_id, CV.cv_id, CV.name as cvname,
      DB.name as dbname, DB.db_id, DBX.accession
    FROM cvterm CVT
      INNER JOIN dbxref DBX on CVT.dbxref_id = DBX.dbxref_id
      INNER JOIN db DB on DBX.db_id = DB.db_id
      INNER JOIN cv CV on CV.cv_id = CVT.cv_id
    WHERE DBX.accession = 'organism' and DB.name = 'local';

I am currently trying to trace what happens. A new dbxref for 'organism' to 'local' db is created. But the existing dbxref uses the 'tripal' db.

I have 459 cvterms in the biomaterial_property that use the 'tripal' db instead of 'local'. Should all of these be corrected?

dsenalik commented 2 years ago

I tested changing dbxrefs from tripal db to local db for these (id values hardcoded!) with

$n = 0;
$sql1 = "SELECT X.dbxref_id, C.cvterm_id, C.name, X.accession FROM {cvterm} C LEFT JOIN {dbxref} X on C.dbxref_id=X.dbxref_id WHERE C.cv_id=58 AND X.db_id=44";
$results1 = chado_query($sql1, []);
foreach ($results1 as $result1) {
  $sql2 = "UPDATE {dbxref} SET db_id = 32 WHERE dbxref_id=:dbxref_id";
  $args2 = [ ':dbxref_id' => $result1->dbxref_id];
  print "dbxref_id=$result1->dbxref_id cvterm_id=$result1->cvterm_id accession=$result1->accession name=$result1->name\n";
  chado_query($sql2, $args2);
  $n++;
}
print "Changed $n records\n";

And now the module updates work without error. Testing may resume...

Well, this affects a number of chado property tripal fields on the biomaterial pages, e.g.

WARNING: The controlled vocabulary term, 'tripal:organism (organism)', assigned to the field, 'tripal__organism', is not in the database. The field cannot be shown. Please add the term and the field will appear below.

I have six of these: organism, description, age, sample_name, biomaterial_provider, tissue. This can be resolved by removing the old versions of the fields with tripal and then "Check for new fields" to get the new ones for local.

dsenalik commented 2 years ago

On the form part 2, there are checkboxes for the cv term, for example

20220317_tissue-1

I think these should be radio buttons, because if more than one is checked, only one of them is used

20220317_tissue-2

Apparently this is from function tripal_get_term_lookup_form() and radio button is not an option? Maybe the code here should store each, then? I dunno...

So far, loading of xml files with one or multiple biosamples is successful!

spficklin commented 2 years ago

Great! I'm glad the loading worked. Yeah, the problem with being able to click multiple elements is with Tripal so we'd need to fix it there. I'll think about what to do in this case.

dsenalik commented 2 years ago

If you think this is a good candidate for a new parameter to tripal_get_term_lookup_form I could try to do that, then you wouldn't have to change anything here except that parameter.

spficklin commented 2 years ago

Yeah. Perhaps we leave it like it is here and we can fix the problem on the Tripal core side of things.

Related to the issues with the dbxrefs.... If you had the problem with the terms being present in the tripal vocabulary and local then others who are currently using the module will have the same problem and won't be able to upgrade. So, that needs to be fixed in this module. I can implement a fix, but I have no way of testing it. If I can put together a fix can you retest it or is your database in state that you can't go back to what it was?

dsenalik commented 2 years ago

I am only testing on my test site, I can test a fix.

spficklin commented 2 years ago

Okay, great! I'll let you know when a fix is ready. Thanks for all your help.

dsenalik commented 2 years ago

The change for a radio button is pretty minimal, you could even incorporate this now as the extra parameter won't do anything without a tripal update

      // Add the lookup fieldset to the form, radio set to return only one term.
      tripal_get_term_lookup_form($form, $form_state, $default_term_name,
          $title, '', FALSE,  '', $delta,
          'tripal_biomaterial_loader_form_ajax_callback', 
          '', [], 0, TRUE);

example with this activated radio1

spficklin commented 2 years ago

@dsenalik after looking over the update hook 7303 I see that the code was meant to prevent the errors you were encountering with the dbxref's. So, I had anticipated the problem (I forgot I did that) but the problem is that it was a bit out of order. I moved things around a bit and I think now it might work if you want to try it.

I like the radio button change PR you submitted to the Tripal core. I left a comment for you there about it. I think we're close to merging this PR as long as your update goes fine with the most recent fix I just pushed.

dsenalik commented 2 years ago

Database update ran perfectly without any special handling or running earlier updates

drush updatedb
 Tripal_biomaterial  7303  Adds the NCBI biomaterial database and adds the terms to it.  Also moves  
                           terms from the tripal vocabulary into local.
Do you wish to run all pending updates? (y/n): y
Performed update: tripal_biomaterial_update_7303                                             [ok]
'all' cache was cleared.                                                                     [success]
Finished performing updates.                                                                 [ok]

I ran loading tests again which were successful.

Compatibility

This module creates a DB "NCBI_BioSample_Terms", but the tripal_eutils module creates a DB for the same purpose nnamed "ncbi_properties". Likewise, this module creates a CV "NCBI Biosample Attributes" and tripal_eutils uses "ncbi_properties". Other DBs are the same between modules e.g. "NCBI BioSample", "NCBI SRA".

For compatibility, can this module use "ncbi_properties"?

I tested this change and it seems to work the same. What do you think about this? for f in includes/TripalImporter/tripal_biomaterial_loader_v3.inc api/tripal_biomaterial.api.inc tripal_biomaterial.install ; do sed -i 's/NCBI_BioSample_Terms/ncbi_properties/' $f ; sed -i 's/NCBI Biosample Attributes/ncbi_properties/' $f ; done

Solved

I am seeing at step 2: Notice: Undefined property: stdClass::$definition in tripal_get_term_lookup_form() (line 712 of .../sites/all/modules/tripal/tripal/api/tripal.terms.api.inc).

Well, this is because the definition for the cv term "ncbi_properties" as defined by tripal_eutils is 333 characters long, and is not automatically expanded by chado_generate_var() in tripal/tripal/api/tripal.terms.api.inc There is code there to expand it, but it doesn't work right. This is a bug in tripal, is fixed in https://github.com/tripal/tripal/pull/1259

dsenalik commented 2 years ago

Line 1877 of tripal_biomaterial_loader_v3.inc should be local instead of tripal?

        if (!tripal_insert_cvterm([
          'name' => (string) $attr_name,
          'definition' => '',
          'cv_name' => 'biomaterial_property',
          'is_relationship' => 0,
          'db_name' => 'tripal',
        ], ['update_existing' => FALSE])) {

Also line 2110

dsenalik commented 2 years ago

Why do we have both NCBI Biosample Attributes (or ncbi_properties) and biomaterial_property cvs? Should these be merged?

dsenalik commented 2 years ago

I have done a test of abandoning the biomaterial_property cv and using only ncbi_properties, and loading seems to continue to work successfully. I am attaching a diff of the changes 20220322_changes.txt I commented out part of the fix for tripal->local db to see if it is even needed now. We would probably want to keep it in for data previously loaded, though.

Please comment if I am doing crazy things here! .

spficklin commented 2 years ago

Why do we have both NCBI Biosample Attributes (or ncbi_properties) and biomaterial_property cvs? Should these be merged?

Oh, that's a good question.

A bit of background about CV and DB records in Chado (apologies if you already know this). When storing vocabularies the db table of Chado is used for storing the "idspace" name of the vocabulary. For the Gene Ontology this is GO for the Sequence Ontology this is SO, etc. The cv table is used for linking terms to their namespace. For example. Within GO you have molecular function as a namespace, so that's the name used in the cv table. To form the term for the vocabulary it's always the idspace name (from the db table) combined with the term accession from the dbxref table. The namespace for the vocabulary is just for organizing the terms.

It's a bit historical why we have a biomaterial_property cv and I may be remembering incorrectly but here goes.... Chado originally came with some vocabularies for tables such as feature_property, They were meant as a dumping ground for terms that were locally defined and not part of any vocabulary. In order to preserve backwards compatibility we kept those supported by Tripal and for consistency may have added some as well. I suspect the biomaterial_property vocabulary used by this module was following that style.

Notice that all of the local terms created by this module are in the local db. By using the biomaterial_property namespace in the local idspace it makes it easier to find terms that are specific to biomaterials. Without it, all local terms are in the local namespace. So having the biomaterial_property namespace provides some separation of terms. But in the end the term is always going to be something like local:term_name.

The cv NCBI BioSample Attributes contains terms from the NCBI_BioSample_Terms db. These are terms that are from NCBI's BioSample Attribute vocabulary (https://www.ncbi.nlm.nih.gov/biosample/docs/attributes/). These are not local terms so they should not be in a biomaterial_property cv. During install, the module downloads the XML for all of these terms and imports them. Unfortunately, this vocabulary provided by NCBI doesn't have an official namespace or idspace. So, I made the idspace NCBI_BioSample_Terms and the namespace NCBI BioSample Attributes so that at least it was descriptive for the vocabulary.

Prior to this PR those attribute terms from NCBI were dumped into the biomaterial_property CV but I felt that was incorrect because these terms come from NCBI. I felt it was important not to put these terms in a local vocabulary because 1) they are official NCBI terms and not created by the local site, and 2) by using NCBI terms officially we have a real vocabulary to use for data exchange. Anyone who uses the NCBI attribute 'tissue' is guaranteed to be using the same term, where as a local term implies it's unique to the site.

Regarding the use of ncbi_properties in the EUtils module. It's used both as a CV namespace and an idspace. I think that's a bit unfortunate a name because it's too general. There are going to be properties for NCBI samples, NCBI projects, NCBI Sequence Runs/Experiments, etc., and I'm guessing that vocabulary name is being used for all of them? I don't know as I've not used the EUtils module before. I don't know how that module is populating those terms. Looking briefly at the module it seems to only create a few terms such as ncbi_properties:submitter_provided_accession. But I think terms like that should be in the local idspace (e.g. local:submitter_provided_accession ) because it's a locally defined term.

I've not used the EUtils module before but if there are indeed official NCBI terms populated into that namespace/idspace then we can look deeper but from my quick look I think there won't be a conflict and my preference is to use a namespace/idspace that reflects the terms added by this module are part of an official NCBI vocabulary. Although, I'm open to changing the naming of either. I'm not 100% happy with them and NCBI doesn't provide an official namepsace/idspace for them.

dsenalik commented 2 years ago

Thank you for this explanation. I suspected there was a good reason that this is how it was set up.

So, other than the two 'db_name' => 'tripal', corrections I noted earlier, then I think it all looks good to me.

And for reference, eutils loads the same terms from NCBI, they have a static copy of the xml included as part of the module. The ncbi BioSample properties CV is downloaded from https://www.ncbi.nlm.nih.gov/biosample/docs/attributes/?format=xml

It's interesting, that module inserts several DBs: 'NCBI BioSample", 'NCBI SRA', 'ncbi_properties', NCBI WGS', 'NCBI GenBank', 'NCBI Assembly', 'Refseq Assembly', 'Genbank Assembly' that are used for xrefs, but adds only the one CV. It looks like only four new terms are defined for that CV not in the downloaded xml.

spficklin commented 2 years ago

One thing that makes the db table a bit confusing is that it is used for more than just vocabularies. For example the NCBI SRA accession IDs are not vocabulary terms but refer to a record in the NCBI SRA (e.g. SRA Run, Experiment or Sample). So, those other entries in the db table that are in the EUtils module, I'm guessing, are for linking these accession numbers. I think the ncbi_properties entry in the db table was the one meant for the vocabulary terms.

dsenalik commented 2 years ago

That is exactly correct, it links accession numbers that way.

spficklin commented 2 years ago

And for reference, eutils loads the same terms from NCBI, they have a static copy of the xml included as part of the module.

I see, so the EUtils module is loading the same terms as this module, but into the ncbi_properties namespace. Are those the only terms loaded into that namespace?

That is a problem because we will have two copies of the same terms in two different namespaces... we probably don't want to create a conflict with end-users who might be using both of these modules....

So we could change this module to use the ncbi_properties idspace (I don't like ncbi_properties as the namespace... it's too generic). Maybe we could ask @Ferrisx4 what he thinks and see if we can come up with a consensus for both modules. I see he's the most recent person to work on that module.

dsenalik commented 2 years ago

Yes, they will be nearly identical, eutils has the static xml file with the date 11-30-18 so there may be terms added since then, plus the four "module-defined" terms I mentioned are actually 5: 'samples provided by', 'Publication', 'note', 'full_ncbi_xml', 'ncbi_FTP_links', 'submitter_provided_accession'

Ferrisx4 commented 2 years ago

As the plan of action was discussed on a different platform, I will post here to bridge the gap.

Summary of changes

  1. Both modules will be modified to use the following
    • namespace: NCBI BioSample Attributes
    • idpsace: NCBI_BioSampleAttributesNCBI_BioSample_Attributes
    • Each module would also have code to update any existing sites to use these.
  2. The EUtils module will be altered to load the NCBI file for sample attributes in a similar manner as this module does currently.

This PR does not have to wait on the forthcoming fix to the EUtils module.


Other notes

Having the EUtils module be a dependency of this module was also considered, but ultimately decided against for the purpose of simplicity.

dsenalik commented 2 years ago

Actually idspace: NCBI_BIoSample_Attributes was the last version proposed.

Ferrisx4 commented 2 years ago

Thanks @dsenalik, I forgot the underscore when writing the comment. Our conversation and the resulting code does include the underscore: NCBI_BIoSample_Attributes

spficklin commented 2 years ago

It's been awhile since I was able to get back to this but it sounds like things are in a good state so I'm going to merge. If there are any residual problems we can fix with new PRs.