Weird undesrcore blocks part retrieval from directory during import_parts.py ?

nickdelkis commented 2 years ago

Ain't sure if the title is what it ought, please suggest a better one when you understand!

As I am looking into why some parts have a "missing sequence" for #153 , I realized that the pipeline uses the Data Source ID column from the excel sheet to look for it in the directory, but is unable to recognize several sequences, even though the file prefix and the Data Source ID match.

The related piece of code

https://github.com/iGEM-Engineering/iGEM-distribution/blob/c3bb1774c722634355a3c4a9049fa8abe280202c/scripts/scriptutils/part_retrieval.py#L341-L434

More specifically, a number of comparisons is being made to find out the missing elements https://github.com/iGEM-Engineering/iGEM-distribution/blob/c3bb1774c722634355a3c4a9049fa8abe280202c/scripts/scriptutils/part_retrieval.py#L405-L418

But, somehow it messes it up. In one of the "Build Package" actions log in my fork, I noticed this (bold text):

Importing parts for package 2A_peptides Package specification contains 5 parts Found 5 parts cached in package design files 0 have sequences in Excel, 1 found in directory, 4 not found 4 parts in directory are not used in package Found 4 unused parts:https://github.com/iGEM-Engineering/iGEM-distribution/2A_peptides/_111772 https://github.com/iGEM-Engineering/iGEM-distribution/2A_peptides/_52092 https://github.com/iGEM-Engineering/iGEM-distribution/2A_peptides/_111818 https://github.com/iGEM-Engineering/iGEM-distribution/2A_peptides/_111773

As I understand it, somewhere in the naming process something (a space? underscore?) lies in front of the actual sequence name and it is interpreted as an underscore during parsing, which blocks fetching from part_retrieval.py, as it is no longer the same string.

I believe sorting this out will sort the "missing sequence" problem out (as the same underscore can be found in every .gb file that is in a directory but "not found" during part retrieval process.

jakebeal commented 2 years ago

Good detective work! I can tell you where the underscores are coming from, and I think that will help guide the solution.

In SBOL, the displayID field, which is the last part of the URI, cannot be an arbitrary string. From Section 6.1 of the SBOL 3 specification:

If the displayId property is used, then its String value MUST be composed of only alphanumeric or underscore 14 characters and MUST NOT begin with a digit.

I forget the details of why this constraint exists (something about RDF infrastructure, I think), but it's why other characters get converted and why an underscore gets added to these pure-number identifiers. Presumably a similar conversion is not being done when we're making the package inventory, and thus we aren't finding the match between the numerical files and the UIDs. I suspect this might also be part of the problem in #151

I believe a good way to handle this will be to make sure that the package inventory translates the names of the items it finds into SBOL-converted names for comparison.

nickdelkis commented 2 years ago

To make sure:

We're talking about variable _inventory_part_ids_andaliases

https://github.com/iGEM-Engineering/iGEM-distribution/blob/9e388ca61a9bcdef897a860665f17fd280562fc0/scripts/scriptutils/part_retrieval.py#L409

and to translate the "wrong" inventory names in SBOL-friendly ones, we need to add the underscore in front of "forbidden" first characters, essentially digits.

How I think I'd do this is set up a list with the "forbidden characters", and check against it for each URI. If it passes, good, if not, add an underscore. In the same process, if parentheses are contained, they should be replaced with SBOL-friendly names.

jakebeal commented 2 years ago

The sbol3 library has a string_to_display_id function that you can use to do the forbidden character conversion for you, so you don't have to write it also. If you want to do the same for the full URL, the function for it is sbol-utilities.helper_functions.url_to_identity.

nickdelkis commented 2 years ago

I'm poking around on part_retrieval.py, specifically:

https://github.com/iGEM-Engineering/iGEM-distribution/blob/321ad2f22711f0ad65b20a842f75b7200d700338/scripts/scriptutils/part_retrieval.py#L387-L403

and found that

https://github.com/iGEM-Engineering/iGEM-distribution/blob/321ad2f22711f0ad65b20a842f75b7200d700338/scripts/scriptutils/part_retrieval.py#L396-L397

retrieval_uri.keys() returns values from part names instead of Data Source ID. Is this what's supposed to happen?

If so, then we should name the gb files according to part name instead of Data Source ID. This still doesn't solve the problem with the forbidden display_ids, however I cannot pinpoint when I should make the intervention and use the string_to_display_id() to make everything SBOL-friendly.

nickdelkis commented 2 years ago

Also, I do not understand the difference between self.locations and self. aliases in

https://github.com/iGEM-Engineering/iGEM-distribution/blob/321ad2f22711f0ad65b20a842f75b7200d700338/scripts/scriptutils/part_retrieval.py#L75-L80

jakebeal commented 2 years ago

I can see these do all need better documentation: please add docstrings and comments as you go!

With respect to the specific questions:

The key in retrieval_uri is an identity URL, which is constructed from the data source columns if possible and the name if the data source columns aren't filled in.
There are often multiple possible variants, however, such as J23101, BBa_J23101, and the multiple sources where it can be downloaded from. This is what the aliases list is for.
In PackageInventory, the location is the local cache file in which the part is found (if any)

nickdelkis commented 2 years ago

During the comparisons to determine which sequences are missing and which are not missing

https://github.com/iGEM-Engineering/iGEM-distribution/blob/a20f5f9b5f82bf58dea71d7450004b4074825460/scripts/scriptutils/part_retrieval.py#L405-L413

the first three variables (iterating through package_parts) contain part names instead of Data Source ID (or that's what I understand), while the inventory-derived variables contain the correct URLs, all of them and also SBOL friendly.

However, in variable both only those matching the two pass through, and this is where we "lose" all the other parts for the readmes.

Is there a way to draw the correct information from the package specification?

PS. I feel like this bug is like 15 minutes of work if you know your way around (which frustratingly I don't), and I've been on this for a week. Apologies for that.

Edit: To articulate what I propose, I feel like this line of code

https://github.com/iGEM-Engineering/iGEM-distribution/blob/a20f5f9b5f82bf58dea71d7450004b4074825460/scripts/scriptutils/part_retrieval.py#L396

should change to actually return the Data Source ID as identity and not the part name. However, I do not know how to do it.

nickdelkis commented 2 years ago

Also pushed the modified part_retrieval.py file in missing_sequence branch, so @jakebeal you will probably see all the prints in the build action!

https://github.com/iGEM-Engineering/iGEM-distribution/runs/5085967966?check_suite_focus=true#step:7:134

jakebeal commented 2 years ago

No apology needed: 1) this is the only way to get to know your way around, and 2) naming is hard and has some subtleties to it.

The key challenge that we're dealing with here is a naming "rendezvous". When we read a part in the inventory, if it's not in SBOL then we have to make some guesses about what its identity "should" be. Likewise, when we read a part from the Excel sheet, we also have to make some guesses about what its identity "should" be. In both cases, our data sources have a "canonical" version of identifiers, but also alternatives that can be used to retrieve the same part. The challenge is that we're trying to figure out the canonical version and also match together the possible names from inventory and the possible names from the sheet all at the same time. Complicating it further, we need to be tolerant enough to be easily used, but not so tolerant that we can make a mistake in a match.

Looking at your debugging printout, the first thing I noticed is that it's trying to download Addgene sequences from SynBioHub, which is clearly incorrect:

Attempting to retrieve SBOL from SynBioHub at [https://github.com:](https://github.com/) https://github.com/iGEM-Engineering/iGEM-distribution/CRISPR-Cas/ABE8e_TadA_8e_V106W

This is produced by the entry "ABE8e TadA-8e V106W" in the Excel file, which is fine except it I would have thought that it should have been "https://www.addgene.org/ABE8e_TadA_8e_V106W", given what's in the data sources sheet.

Then I notice that the name given in the corresponding file, 138495.gb is ABE8e(TadA-8e_V106W). So that seems like maybe it's not surprising that they aren't matching because one has parentheses and the other doesn't?

nickdelkis commented 2 years ago

The part you mention, has a part name without parentheses, data Source ID as Addgene ref (integer) and the discrepancy in the GB file, which contains the parentheses in the locus. However, shouldn't the Data Source be identified first since it is filled? I am missing something here. debug

Moreover, I see that when importing fastas and gb files, we use record.id for fastas and record.name for gbs. Is it just a naming convention? https://github.com/iGEM-Engineering/iGEM-distribution/blob/a20f5f9b5f82bf58dea71d7450004b4074825460/scripts/scriptutils/part_retrieval.py#L351-L372

I think I need to create a decision tree to wrap my head around it and understand which entry has priority on inventorying and in the excel, so we can document some guidelines about what should people do when adding parts (e.g. matching file name with Data Source ID, or Locus in gb files with Data Source ID, or ...)

jakebeal commented 2 years ago

You're right --- it should be able to use the 138495 as a shared key in both places, and it looks like it isn't.

record.id for fastas and record.name for gbs

I believe that for FASTA, both the record.id and record.name are identical (FASTAs have little metadata). With GB files they are not, because the ID is often given version information too. We probably really should be checking both, however.

I think a decision tree is a great idea --- writing this down carefully and putting it in the documentation will help you and me as well as anybody who comes after us. (Note also that this whole thing will eventually be migrating into SBOL-utilities too, so that it can be reused with other packages)

nickdelkis commented 2 years ago

I will try to do the decision tree in the coming days if not today.

About the issue, would it be worth to somehow include the actual file name to be considered (using an OR gate) as well as the id and name?

This way any of the given "aliases" should be accepted.

jakebeal commented 2 years ago

If the file is one that contains precisely one sequence, then yes, that would be reasonable to match. FASTA and GenBank files are often used with precisely one sequence, but can have multiple.

nickdelkis commented 2 years ago

After our discussion in the weekly software meeting about naming and the priorities of the part naming from the excel file, I tried some changes to see if what we discussed was correct:

For parts that have Data Source prefix as "Addgene", which is Literal Part --> FALSE in data_source sheet, I removed the prefix and tried to import_parts.py again. This did not solve the problem as expected, since the Locus name in the .gb file was different than the part name.

However, if we are supposed to give priority to Data Source ID as display_id, this fails when the ID given if forbidden (although it is being coverted down the line):

For example, if I try to make the SpCas9 or 39312.gb file to not have a missing_sequence by changing the locus name and the Data Source ID so they match, but they start with a digit, the software regards the correct display_id as the part name and so fails to identify the sequence.

In contrast, the PE2 or 132775.gb file has PE2 as part name, Addgene as prefix and 132775 as Data Source ID. The Locus name in the file is PE2, like the part name. This name is correctly identified from the directory and does not produce an error.

If I go to 138495.gb or ABE8_TadA-8e_V106W, which contains parentheses in the Locus Name, it is regarded as missing sequence because the Locus Name does not match the part name. When parentheses are removed and replaced with underscore. The part is not considered missing.

This tells me that what is important to avoid errors during naming (for now), is:

Provide a meaningful LOCUS name in the .gb file or fastaid, that is not SBOL-forbidden (e.g. doesn't start with a digit and does not contain parentheses)
Exactly match the LOCUS name with the Part Name. If there are spaces in the Part Name, add underscores in the LOCUS name (works fine)

I implemented these suggestions to the CRISPR-Cas package locally and all sequences are fine now (not regarded as missing)

iGEM-Engineering / iGEM-distribution

Weird undesrcore blocks part retrieval from directory during import_parts.py ? #182