iGEM-Engineering / iGEM-distribution

Repository for collective design of an iGEM DNA distribution
https://igem-distribution.readthedocs.io
Other
42 stars 20 forks source link

Unclear warning: "part in missing sequence" #153

Closed jakebeal closed 2 years ago

jakebeal commented 2 years ago

In some places, such as the CRISPR-Cas package, the package README reports an unclear warning message, e.g.:

SpCas9 (CDS) in missing sequence

I think that what this means is that the plasmid vector has a missing sequence or that it doesn't know which part of the construct is the plasmid vector. That's not clear from this message, however, so the message should be improved.

nickdelkis commented 2 years ago

Not sure if that helps, but it more likely has to do with variable missing_seq in generate_markdown.py ( https://github.com/iGEM-Engineering/iGEM-distribution/blob/80d03b4e2532844eb981e916e2ab80a35a2ead96/scripts/scriptutils/generate_markdown.py#L47 )

and its write function during generating the readme https://github.com/iGEM-Engineering/iGEM-distribution/blob/80d03b4e2532844eb981e916e2ab80a35a2ead96/scripts/scriptutils/generate_markdown.py#L111-L112

I will look through it myself but I need to complete the SBOL3 and pySBOL3 tutorials first to understand things.

jakebeal commented 2 years ago

@nickdelkis I think you've likely got your finger on it, and we probably want to just make that error more context sensitive. I definitely encourage you to take this on!

nickdelkis commented 2 years ago

@jakebeal Will do!

nickdelkis commented 2 years ago

In order to provide a context-sensitive error for the missing sequences, we need to understand why they are missing in the first place. For example, in the CRISPR-Cas collection, all parts in excel have a corresponding .gb file in the github directory, either with the Addgene ref number as their name, or their generic name. All of them are CDSs, so linear parts. However, some .gb files are configured as circular. the batch that @GC-repeat added in September is "missing sequences", while the PE/BE editors that @ethanj801 added a couple weeks ago are recognized from the directory ( though not all of them).

I tried adding a couple of sequences myself in my fork, a codon-optimized Cas9 for fungi and an anti-CRIPSR CDS. The first is not recognized whatever I do, the latter was recognized instantly, although they were both exported from benchling with annotations made in Snapgene. The stark difference is in the size (250 bp for Acr, 4100 bp for Cas9).

So after some tinkering with the excel file, adding/editing sequences, changing their roles/altered_sequence, suffixes (e.g. gb to genbank), circularity/linearity, uperrcase/lowercase, I still cannot understand what triggers some seqs to be retrieved from the directory files and others not to.

I believe the answer is hidden somewhere in https://github.com/iGEM-Engineering/iGEM-distribution/blob/3cbbab40755ca02985fd6fdee9a4ee7673ff24b6/scripts/scriptutils/part_retrieval.py#L341

and

https://github.com/iGEM-Engineering/iGEM-distribution/blob/3cbbab40755ca02985fd6fdee9a4ee7673ff24b6/scripts/scriptutils/part_retrieval.py#L361-L372

I have not tried to add fasta files in the directory to see if that triggers seq retrieval or not, which I will do later today or tomorrow.

jakebeal commented 2 years ago

My recent update in #171 that fixes #168 has caused some things that had missing messages before to become less obscure in some of our packages, such as CRISPR-Cas. This doesn't change this bug, but does indicate that the missing vector Id is important to it.