List of things to do - High Priority

[ ] Write and include code that can effectively handle the exceptions encountered when running the INT benchmarking data set:
```
[WARNING] Unable to extract gene name for CDS starting at `9876` of `NC_123456`. Skipping feature ...
Error message: 'gene'
```
Explanation: Normally, every gene annotation in a GenBank flatfile must be accompanied by a corresponding CDS annotation. Often, only the CDS annotation contains the actual gene name. Unfortunately, there are plenty of GenBank flatfiles out there, where gene annotations without their CDS annotations are given. Such annotations (i.e., gene annotation only without accompanying CDS annotation) should be ignored by PlastomeBurstAndAlign and not even receive a warning.

[ ] Write and include code that can effectively handle the extraction of intergenic spacers adjacent to rps12 (see when running the IGS benchmarking dataset):


2024-06-12 21:41:30 [WARNING] NC_123456: the IGS between `rps12` and `trnH-GUG` is currently not handled and would have to be extracted manually. Skipping this IGS ...

2024-06-12 21:41:30 [WARNING] NC_123456: the IGS between rpl20 and rps12 is currently not handled and would have to be extracted manually. Skipping this IGS ...

2024-06-12 21:41:30 [WARNING] NC_123456: the IGS between rps12 and clpP is currently not handled and would have to be extracted manually. Skipping this IGS ...

**Explanation:** The gene _rps12_ is the only gene in the plastid genome that is split into three different exons, one of which is located far away from the other two (i.e., half-way across the genome). From a paper: "_The first exon of the rps12 gene is generally located in the large single-copy (LSC) region, whereas the second and third exons reside in the inverted repeats (IRs)_" The typical extraction process of intergenic spacers in which two adjacent genes are taken and the region between them extracted as "intergenic spacer" (hence, the name) does not work in this case. Instead, _rps12_ will need to be treated as two different genes when intergenic spacers where _rps12_ is one of the flanking genes are extracted. Until such a special extraction routine for all IGS, which _rps12_ is involved in as a flanking gene (typically  `rpl20--rps12` and `rps12--clpP` for exon 1 and `rps7--rps12` and `rps12--trnV-GAC` for exons 2 and 3), is developed, the above warning is printed.

---

- [ ] Update the package to handle the exceptions encountered when running the CDS benchmarking**2** data set:

2024-06-18 21:44:22 [WARNING] Unable to convert alignment of atpF from FASTA to NEXUS. Error message: [Errno 2] No such file or directory: 'benchmarking2/output_CDS/nucl_atpF.aligned.fasta' 2024-06-18 21:44:22 [WARNING] Unable to convert alignment of rps2 from FASTA to NEXUS. Error message: [Errno 2] No such file or directory: 'benchmarking2/output_CDS/nucl_rps2.aligned.fasta' 2024-06-18 21:44:35 [WARNING] Unable to convert alignment of rpoC1 from FASTA to NEXUS. ...

**Explanation:** It must have something to do with the immediately preceding warnings. For example:

2024-06-18 21:44:12 [WARNING] Inconsistent lengths for ndhF_MG925366, ungapped protein 309, tripled 927 vs ungapped nucleotide 975. Protein: MRKYPSKVSKYILNINAVNPVVNQAISAKIGEYNQLSRISSLDQKQARGGIPQRESVPKK ...............!!!!!!.!!!!!!!!!!!!!!!!!!!!!!!.!!!!!!!!!!!!!! Translation: MRKYPSKVSKYILNIXNAVNPVVNQAISAKIGEYNQLSXRISSLDQKQARGGIPQRESVP

Protein: KVVFVIGTYFVKPPIRTIFLFSGEYPTIGSIEIMDPDPKNNKALEAWVIKNKAAREPIPR .!!!.!!!!!!!!!!!!!!!.!!!!!!!!!.!!!!!!!!!!!!!!!!!!!!!!!!!!!!! Translation: KKKVVFVIGTYFVKPPIRTIFXLFSGEYPTIGSIEXIMDPDPKNNKALEXAWVIKXNKAA

Protein: ANIIPNDIVEAKLFLMSLARAKVAPKSTVIIPTKQMRFIMGITMKRGRSRATRKIPAATI !!!!!!!!!!!!!!!!!!!!..!!!!.!!!!!!!!!!!!!!.!!.!!.!!!!!!!!!!!! Translation: RXEPIPRANIIXPNXDIVEXAKLFLMSLXARAKVAPKSTVIIPTKQMRFIMXGITMKRGR

Protein: VAACIRAEIGVGPSMASGNHTRGNCADLAIAPTNNKKAHKVANKELTPLLWIKVFTIWNK !!.!!!!!!!!!!!!!!!!!!!!!!!!!!!!!.!!!!!!!!.!!!!!!!!!!..!!!!!! Translation: SRATRKIPAATIVAACIRAEIGVGPSMASGNHTXRGNCADLAIAPTNNKKAHKVANKELT

Protein: SRNSKLPVIQNPKVPNNKPKSPIRLVTKAFQALAATGRVNQKPINKYEHIPTSSQKIICI !!!!!!!!!!!!.!!.!.!!!!!!!!!!!!!!!!!!!.!!!.!!!!!!!!!!!!!!!!!! Translation: PLLWIKVFTIWNKSRNSKLPVIQXNPKVPNNKPKSPIRLVTKAFXQALAATGRVNQKPIN

Protein: KLELVTNPN .!.!!!!!! Translation: KYEHIPTSSQKIXICIKLELVTNPN



---

- [ ] Refactor the code so that a flow chart of the software processes can be generated more easily. ([Here](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6328100/figure/pone.0210347.g003/) is an example of a flowchart that is reasonable.)
  - For example, keep function names short so that they can be spelled out in the flow chart.

About Item 2, the part about splitting rps12 into features with simple locations is pretty straightforward. However, how to deal with their adjacency to other genes is less obvious to someone with minimal background in intergenic spacers. Intuitively it would seem to make sense to move these new features into their appropriate new location to be properly adjacent. This appears to be the correct approach, but other times its original placement in the genes list makes more sense. For an example, let us consider NC_034323.

The first gene is rps12, with the relevant feature data for the extraction iteration being

rps12
join{[69202:69316](-), [97404:97647](-)}

trnH-GUG
[4:79](-)

From your discussion about this issue, this adjacency appears correct but it is not clear which of the locations should be considered adjacent to trnH-GUG. Further, if the proper adjacent location can be determined, it feels odd that such distant genes would be considered adjacent.

The occurrence of rps12 on the other strand appears to make more sense in regard to adjacency. The feature locations look like

rpl20
[68085:68466](-)

rps12
join{[69202:69316](+), [137901:138144](+)}

clpP
[69465:71495](-)

The first location of the gene properly slots in between the other two, and corresponds to the other common occurrences of rps12 that you mention.

The two unused rps12 locations could be inserted between

rps7
[96333:96801](-)

trnV-GAC
[99493:99565](+)

and

trnV-GAC
[135983:136055](-)

rps7
[138747:139215](+)

but it is unclear if this is a proper course of action.

In summary, it seems that there are two possible approaches, which differ based on the interpretation of the item. The first is much simpler, which assumes that rps12 is in the correct location in the list of genes. In this case, both simple location features would be considered adjacent to the preceding and succeeding genes in the list. The second approach also produces simple locations, but instead, moves them to the correct position in the list so that they are adjacent by location. There is possibly a hybrid of these two approaches, but that would probably require more explanation on your part.

NC_034323

The above displays the plastid genome map of plastome record NC_034323. The positions of the three exons of rps12 are highlighted by red arrows. Please note that exons 2 and 3 occur twice because many (but not all) plastid genomes have inverted repeats (i.e., the inverted duplicated section at the end of the genome that is called "IRa"). Highlighted in blue is the counter-clockwise direction in which a plastid genome is structured, with trnH-GUG typically begin the very first gene. The IRa is not included in the blue direction indicator.

The fact that rps12 is the first gene listed is a technical artifact by NCBI, who always list this gene first due to its unique characteristic of being located in different sections of the genome (see its GenBank record). rps12 is by no means the first gene in the genome. Hence, it is biologically impossible that rps12 and trnH-GUG have an intergenic spacer (IGS) between them. Rather, the IGS that rps12 is involved as a flanking gene in are:

rpl20--rps12 and rps12--clpP for exon 1
rps7--rps12 and rps12--trnV-GAC for exons 2 and 3 (I have corrected this fact in the post above as well.) However, these IGS cannot be hard-coded, as some genes are occasionally lost during plant evolution.

A proper functioning algorithm would proceed in the counter-clockwise reading direction as displayed above and first extract IGS rpl20--rps12 (68467--69201), then IGS rps12--clpP (69317--69464), then rps7--rps12 (96802--97404), and finally rps12--trnV-GAC (97648--99492).

Interestingly, the way rps12 looks for you (i.e., rps12 join{[69202:69316](+), [137901:138144](+)}) is only the partially duplicated copy involving IRa, not the actual gene (whose exons 2 and 3 are located in IRb), which would be better to use. The IRb version of rps12 would look something like the following and may be among the extracted annotations somewhere: rps12 join{[69202:69316](+), [97405:97647](?)}

Hence, the first thing to do is probably to sort the annotations by their exact genomic position instead of replying on the order provided through the input file. While that order is generally correct, it is not for rps12.

The next step may be to remove that annotation of rps12 that has exons 2 and 3 in IRb and instead only keep the version that has exons 2 and 3 in IRa - assuming both annotations are present. Note, however, that long ago I had implemented a function that was supposed to delete all duplicate annotations (to remove all annotations in IRb); if that function is not working properly, it may have removed the IRa version of rps12, which would need to be corrected.

For Item 3, does this refer to a script (either new or updated) that has not yet been pushed to the repo? I am unable to reproduce this with test_script_igs.py. If this is indeed the proper script, this might be a difference between Linux and Windows that I can look into.

It only occurs for benchmark dataset 2, not for benchmark dataset 1, which is the default one listed in test_script_igs.py. Hence, you would need to mass-replace "benchmarking1" with "benchmarking2" in that file and then run it to achieve the error messages. Hence, the underlying problem of these errors has something to do with the exact GenBank flatfiles that are in benchmark dataset 2.

michaelgruenstaeudl / PlastomeBurstAndAlign

List of things to do - High Priority #18