Closed spficklin closed 2 years ago
@dsenalik I think you are using this module on your site. Would you mind testing this PR for me? I've got permission from the Staton Lab to merge in pull requests but I figure I should get someone else to review if possible.
I will get to this eventually, I am working on creating a dataset to load
Great! Thanks @dsenalik . I have a new loader for expression data in PR #397 if you could do a test of that as well I'd appreciate it.
@dsenalik any interest in reviewing this and the other 2 PRs? I need to merge them in to avoid some complicated merge conflicts. If you don't have time then I can merge and hope for the best and we can fix issue as they come up.
I have not done this yet because I wanted to find "real" data to test with. That was harder than I thought, I'll go ahead with some fake data on my test system.
That would be great. I have it working with a 475 NCBI BioSample XML sample file list but I'm worried I may have broken it for other inputs so if it can work with what you want to load for your site I think would be a great test.
foreach
Update 7303 lots of errors of the form
ERROR (TRIPAL_CHADO): chado_insert_record; Cannot insert duplicate record into cvterm table: Array
(
[cv_id] => 58
[name] => SimpleXMLElement Object
(
[0] => treatment
)
[definition] =>
[dbxref_id] => 443412
[is_obsolete] => 0
[is_relationshiptype] => 0
)
[site http://default] [TRIPAL ERROR] [TRIPAL_CHADO] chado_insert_record; Cannot insert duplicate record into cvterm table: Array( [cv_id] => 58 [name] => SimpleXMLElement Object ( [0] => treatment ) [definition] => [dbxref_id] => 443412 [is_obsolete] => 0 [is_relationshiptype] => 0) sizeof(): Parameter must be an array or an object that implements Countable [warning] tripal.notice.api.inc:134 [site http://default] [TRIPAL WARNING] [TRIPAL_CV] Failed to insert the term: treatment (local)
Should line 407 be ```!$cvterm->find()``` ?
Thanks for the review @dsenalik. Some comments in reply:
drush php-eval 'module_load_include("install", "tripal_biomaterial", "tripal_biomaterial"); tripal_biomaterial_update_7303();'
update
not an insert
.Sorry the install was so problematic. :0/
I have a situation. The old version of this module had code such as
tripal_insert_cvterm([
'name' => 'organism',
'def' => 'The most descriptive organism name for this sample (to the species, if relevant).',
'cv_name' => 'biomaterial_property',
'db_name' => 'tripal',
]);
So I currently have the 'organism' cvterm in the 'biomaterial_property' cv linked with a dbxref to the 'tripal' db.
When I try to run update 7303 it loads some cvterms with code such as
chado_insert_cvterm([
'id' => 'local:organism',
'name' => 'organism',
'definition' => 'The most descriptive organism name for this sample (to the species, if relevant).',
'cv_name' => 'biomaterial_property',
'db_name' => 'local',
]);
which produces the following error
ERROR (TRIPAL_CHADO): chado_insert_record; Cannot insert duplicate record into cvterm table: Array
(
[cv_id] => 58 (this is the biomaterial_property cv)
[name] => organism
[definition] => The most descriptive organism name for this sample (to the species, if relevant).
[dbxref_id] => 442945
[is_obsolete] => 0
[is_relationshiptype] => 0
)
inside tripal_chado.cv.api.inc in the function chado_insert_cvterm() the following query is used to find the cvterm, but it returns zero results, and that triggers the error above
SELECT CVT.name, CVT.cvterm_id, CV.cv_id, CV.name as cvname,
DB.name as dbname, DB.db_id, DBX.accession
FROM cvterm CVT
INNER JOIN dbxref DBX on CVT.dbxref_id = DBX.dbxref_id
INNER JOIN db DB on DBX.db_id = DB.db_id
INNER JOIN cv CV on CV.cv_id = CVT.cv_id
WHERE DBX.accession = 'organism' and DB.name = 'local';
I am currently trying to trace what happens. A new dbxref for 'organism' to 'local' db is created. But the existing dbxref uses the 'tripal' db.
I have 459 cvterms in the biomaterial_property that use the 'tripal' db instead of 'local'. Should all of these be corrected?
I tested changing dbxrefs from tripal db to local db for these (id values hardcoded!) with
$n = 0;
$sql1 = "SELECT X.dbxref_id, C.cvterm_id, C.name, X.accession FROM {cvterm} C LEFT JOIN {dbxref} X on C.dbxref_id=X.dbxref_id WHERE C.cv_id=58 AND X.db_id=44";
$results1 = chado_query($sql1, []);
foreach ($results1 as $result1) {
$sql2 = "UPDATE {dbxref} SET db_id = 32 WHERE dbxref_id=:dbxref_id";
$args2 = [ ':dbxref_id' => $result1->dbxref_id];
print "dbxref_id=$result1->dbxref_id cvterm_id=$result1->cvterm_id accession=$result1->accession name=$result1->name\n";
chado_query($sql2, $args2);
$n++;
}
print "Changed $n records\n";
And now the module updates work without error. Testing may resume...
Well, this affects a number of chado property tripal fields on the biomaterial pages, e.g.
WARNING: The controlled vocabulary term, 'tripal:organism (organism)', assigned to the field, 'tripal__organism', is not in the database. The field cannot be shown. Please add the term and the field will appear below.
I have six of these: organism, description, age, sample_name, biomaterial_provider, tissue. This can be resolved by removing the old versions of the fields with tripal and then "Check for new fields" to get the new ones for local.
On the form part 2, there are checkboxes for the cv term, for example
I think these should be radio buttons, because if more than one is checked, only one of them is used
Apparently this is from function tripal_get_term_lookup_form()
and radio button is not an option?
Maybe the code here should store each, then? I dunno...
So far, loading of xml files with one or multiple biosamples is successful!
Great! I'm glad the loading worked. Yeah, the problem with being able to click multiple elements is with Tripal so we'd need to fix it there. I'll think about what to do in this case.
If you think this is a good candidate for a new parameter to tripal_get_term_lookup_form
I could try to do that, then you wouldn't have to change anything here except that parameter.
Yeah. Perhaps we leave it like it is here and we can fix the problem on the Tripal core side of things.
Related to the issues with the dbxrefs.... If you had the problem with the terms being present in the tripal
vocabulary and local
then others who are currently using the module will have the same problem and won't be able to upgrade. So, that needs to be fixed in this module. I can implement a fix, but I have no way of testing it. If I can put together a fix can you retest it or is your database in state that you can't go back to what it was?
I am only testing on my test site, I can test a fix.
Okay, great! I'll let you know when a fix is ready. Thanks for all your help.
The change for a radio button is pretty minimal, you could even incorporate this now as the extra parameter won't do anything without a tripal update
// Add the lookup fieldset to the form, radio set to return only one term.
tripal_get_term_lookup_form($form, $form_state, $default_term_name,
$title, '', FALSE, '', $delta,
'tripal_biomaterial_loader_form_ajax_callback',
'', [], 0, TRUE);
example with this activated
@dsenalik after looking over the update hook 7303 I see that the code was meant to prevent the errors you were encountering with the dbxref's. So, I had anticipated the problem (I forgot I did that) but the problem is that it was a bit out of order. I moved things around a bit and I think now it might work if you want to try it.
I like the radio button change PR you submitted to the Tripal core. I left a comment for you there about it. I think we're close to merging this PR as long as your update goes fine with the most recent fix I just pushed.
Database update ran perfectly without any special handling or running earlier updates
drush updatedb
Tripal_biomaterial 7303 Adds the NCBI biomaterial database and adds the terms to it. Also moves
terms from the tripal vocabulary into local.
Do you wish to run all pending updates? (y/n): y
Performed update: tripal_biomaterial_update_7303 [ok]
'all' cache was cleared. [success]
Finished performing updates. [ok]
I ran loading tests again which were successful.
Compatibility
This module creates a DB "NCBI_BioSample_Terms", but the tripal_eutils module creates a DB for the same purpose nnamed "ncbi_properties". Likewise, this module creates a CV "NCBI Biosample Attributes" and tripal_eutils uses "ncbi_properties". Other DBs are the same between modules e.g. "NCBI BioSample", "NCBI SRA".
For compatibility, can this module use "ncbi_properties"?
I tested this change and it seems to work the same. What do you think about this?
for f in includes/TripalImporter/tripal_biomaterial_loader_v3.inc api/tripal_biomaterial.api.inc tripal_biomaterial.install ; do sed -i 's/NCBI_BioSample_Terms/ncbi_properties/' $f ; sed -i 's/NCBI Biosample Attributes/ncbi_properties/' $f ; done
Solved
I am seeing at step 2:
Notice: Undefined property: stdClass::$definition in tripal_get_term_lookup_form() (line 712 of .../sites/all/modules/tripal/tripal/api/tripal.terms.api.inc).
Well, this is because the definition for the cv term "ncbi_properties" as defined by tripal_eutils
is 333 characters long, and is not automatically expanded by chado_generate_var()
in tripal/tripal/api/tripal.terms.api.inc
There is code there to expand it, but it doesn't work right. This is a bug in tripal, is fixed in https://github.com/tripal/tripal/pull/1259
Line 1877 of tripal_biomaterial_loader_v3.inc
should be local
instead of tripal
?
if (!tripal_insert_cvterm([
'name' => (string) $attr_name,
'definition' => '',
'cv_name' => 'biomaterial_property',
'is_relationship' => 0,
'db_name' => 'tripal',
], ['update_existing' => FALSE])) {
Also line 2110
Why do we have both NCBI Biosample Attributes
(or ncbi_properties
) and biomaterial_property
cvs? Should these be merged?
I have done a test of abandoning the biomaterial_property
cv and using only ncbi_properties
, and loading seems to continue to work successfully. I am attaching a diff of the changes
20220322_changes.txt
I commented out part of the fix for tripal->local db to see if it is even needed now. We would probably want to keep it in for data previously loaded, though.
Please comment if I am doing crazy things here! .
Why do we have both NCBI Biosample Attributes (or ncbi_properties) and biomaterial_property cvs? Should these be merged?
Oh, that's a good question.
A bit of background about CV and DB records in Chado (apologies if you already know this). When storing vocabularies the db
table of Chado is used for storing the "idspace" name of the vocabulary. For the Gene Ontology this is GO
for the Sequence Ontology this is SO
, etc. The cv
table is used for linking terms to their namespace. For example. Within GO you have molecular function
as a namespace, so that's the name used in the cv
table. To form the term for the vocabulary it's always the idspace name (from the db
table) combined with the term accession from the dbxref
table. The namespace for the vocabulary is just for organizing the terms.
It's a bit historical why we have a biomaterial_property
cv and I may be remembering incorrectly but here goes.... Chado originally came with some vocabularies for tables such as feature_property
, They were meant as a dumping ground for terms that were locally defined and not part of any vocabulary. In order to preserve backwards compatibility we kept those supported by Tripal and for consistency may have added some as well. I suspect the biomaterial_property
vocabulary used by this module was following that style.
Notice that all of the local terms created by this module are in the local
db. By using the biomaterial_property
namespace in the local
idspace it makes it easier to find terms that are specific to biomaterials. Without it, all local terms are in the local
namespace. So having the biomaterial_property
namespace provides some separation of terms. But in the end the term is always going to be something like local:term_name
.
The cv NCBI BioSample Attributes
contains terms from the NCBI_BioSample_Terms
db. These are terms that are from NCBI's BioSample Attribute vocabulary (https://www.ncbi.nlm.nih.gov/biosample/docs/attributes/). These are not local terms so they should not be in a biomaterial_property
cv. During install, the module downloads the XML for all of these terms and imports them. Unfortunately, this vocabulary provided by NCBI doesn't have an official namespace or idspace. So, I made the idspace NCBI_BioSample_Terms
and the namespace NCBI BioSample Attributes
so that at least it was descriptive for the vocabulary.
Prior to this PR those attribute terms from NCBI were dumped into the biomaterial_property
CV but I felt that was incorrect because these terms come from NCBI. I felt it was important not to put these terms in a local vocabulary because 1) they are official NCBI terms and not created by the local site, and 2) by using NCBI terms officially we have a real vocabulary to use for data exchange. Anyone who uses the NCBI attribute 'tissue' is guaranteed to be using the same term, where as a local
term implies it's unique to the site.
Regarding the use of ncbi_properties
in the EUtils module. It's used both as a CV namespace and an idspace. I think that's a bit unfortunate a name because it's too general. There are going to be properties for NCBI samples, NCBI projects, NCBI Sequence Runs/Experiments, etc., and I'm guessing that vocabulary name is being used for all of them? I don't know as I've not used the EUtils module before. I don't know how that module is populating those terms. Looking briefly at the module it seems to only create a few terms such as ncbi_properties:submitter_provided_accession
. But I think terms like that should be in the local idspace (e.g. local:submitter_provided_accession
) because it's a locally defined term.
I've not used the EUtils module before but if there are indeed official NCBI terms populated into that namespace/idspace then we can look deeper but from my quick look I think there won't be a conflict and my preference is to use a namespace/idspace that reflects the terms added by this module are part of an official NCBI vocabulary. Although, I'm open to changing the naming of either. I'm not 100% happy with them and NCBI doesn't provide an official namepsace/idspace for them.
Thank you for this explanation. I suspected there was a good reason that this is how it was set up.
So, other than the two 'db_name' => 'tripal',
corrections I noted earlier, then I think it all looks good to me.
And for reference, eutils loads the same terms from NCBI, they have a static copy of the xml included as part of the module.
The ncbi BioSample properties CV is downloaded from https://www.ncbi.nlm.nih.gov/biosample/docs/attributes/?format=xml
It's interesting, that module inserts several DBs: 'NCBI BioSample", 'NCBI SRA', 'ncbi_properties', NCBI WGS', 'NCBI GenBank', 'NCBI Assembly', 'Refseq Assembly', 'Genbank Assembly' that are used for xrefs, but adds only the one CV. It looks like only four new terms are defined for that CV not in the downloaded xml.
One thing that makes the db
table a bit confusing is that it is used for more than just vocabularies. For example the NCBI SRA accession IDs are not vocabulary terms but refer to a record in the NCBI SRA (e.g. SRA Run, Experiment or Sample). So, those other entries in the db
table that are in the EUtils module, I'm guessing, are for linking these accession numbers. I think the ncbi_properties
entry in the db
table was the one meant for the vocabulary terms.
That is exactly correct, it links accession numbers that way.
And for reference, eutils loads the same terms from NCBI, they have a static copy of the xml included as part of the module.
I see, so the EUtils module is loading the same terms as this module, but into the ncbi_properties
namespace. Are those the only terms loaded into that namespace?
That is a problem because we will have two copies of the same terms in two different namespaces... we probably don't want to create a conflict with end-users who might be using both of these modules....
So we could change this module to use the ncbi_properties
idspace (I don't like ncbi_properties
as the namespace... it's too generic). Maybe we could ask @Ferrisx4 what he thinks and see if we can come up with a consensus for both modules. I see he's the most recent person to work on that module.
Yes, they will be nearly identical, eutils has the static xml file with the date 11-30-18 so there may be terms added since then, plus the four "module-defined" terms I mentioned are actually 5: 'samples provided by', 'Publication', 'note', 'full_ncbi_xml', 'ncbi_FTP_links', 'submitter_provided_accession'
As the plan of action was discussed on a different platform, I will post here to bridge the gap.
NCBI BioSample Attributes
NCBI_BioSampleAttributes
NCBI_BioSample_Attributes
This PR does not have to wait on the forthcoming fix to the EUtils module.
Having the EUtils module be a dependency of this module was also considered, but ultimately decided against for the purpose of simplicity.
Actually idspace: NCBI_BIoSample_Attributes
was the last version proposed.
Thanks @dsenalik, I forgot the underscore when writing the comment. Our conversation and the resulting code does include the underscore: NCBI_BIoSample_Attributes
It's been awhile since I was able to get back to this but it sounds like things are in a good state so I'm going to merge. If there are any residual problems we can fix with new PRs.
This PR addresses issues with the BioSample loader:
381: When loading CVT/TSV files cvterm selection doesn't appear.
383: Headers in input files must exactly match CV terms.
384: When too many columns (attributes) are present in the input file then the browser crashes.
385: Loader is extremely slow (I need to load 10K samples)
Fix for #385
This was fixed by rewriting the loader to mimic the Tripal GFF3 loader such that batches of records are inserted together rather than at once.
Fix for #381, #383 and #384
The importer now uses a multi-step form. The first step, the user provides the input file, sets the organism and the analysis. On the second step, the user is presented with all of the attributes and they can set the controlled vocabulary for each one. To help ease the burden of setting these terms, a new NCBI_BioSample_Terms cv/db is added and is populated with terms from this XML file: http://www.ncbi.nlm.nih.gov/biosample/docs/attributes/?format=xml. This happens during install of the module and updated on existing installations via the update 7302.
Testing
drush updatedb
to get the database updates.