Preemptive handling of known raised exceptions & compound location handling for intergenic features

alephnull7 commented 1 week ago

Changes directly addressing listed issues:

In intron extraction, checking for the existence of "gene" in the qualifiers of the features is performed before attempting extractions, preventing the unnecessarily raised exception Error message: 'gene'.
When performing back-translation, back-translated sequences are only added to the aligned list if the process was successful (not None). Within the translation of individual sequences, the assertions have been replaced with None returns, and None is returned if the nucleotide evaluation was unsuccessful (None was returned). This allows all successful back-translations to be saved to file, instead of an issue with any individual sequence raising an exception and preventing the saving of a MultipleSeqAlignment object to file. Now, the No such file or directory error should only occur if no translations were successful.
The first iteration of a procedure to resolve genes with compound locations has been implemented. Below are the common compound location gene patterns I've noticed, and how they are currently being accounted for.
- Within a single genome, the most common compound location genes are rps12, usually consisting of two or four total locations. In the former case, these two simple location genes are inserted into the proper location in the genes list, and in the latter, three genes are inserted into the list, as the annotation in IRb is duplicated. Later, when adding these items to the dictionary, only the first encountered rps12 annotation in IRb and IRa will be added.
- Occasionally, there will be a rps12 feature with more than two simple locations. This seems to occur when a subset of the location list consists of duplicate annotations for the same genes, with minor differences in location. Another cause of this appears to be when exons 2 and 3 are annotated separately. A third case of this looks like annotations for the separate exons as well as an additional annotation for them considered contiguously. It should be noted that combinations of these occurrences have been observed. Right now, to account for these variations, an rps12 annotation is inserted into the desired location of the genes list if it begins after the previous gene and ends before the succeeding gene. If the annotation does not fit this criteria and overlaps with an adjacent gene, it replaces it if it is the same gene and is longer. This is in an effort to put the longest annotation corresponding to the gene as possible in a given location in the sequence. Other additions such as merging overlapping annotations of the same gene have been considered, but before going ahead with that I wanted to receive feedback about the current approach.
- Other compound location genes have the same features of rps12, in regard to duplicate annotations and split sequences, and are handled in the same way.
The naming of methods has been shortened while attempting to conserve or improve their meaning.

Other notable changes:

The handling of feature extraction and adding features to the nucleotide/protein dictionaries has been moved to "Feature" classes and within PlastidData methods, respectively.
The constructor for BackTranslation has been updated to set member variables that are used throughout the back-translation process and the methods have been updated to reflect this.
Instead of updating the nucleotide dictionary with the auxiliary intron dictionary at the end of the intron extraction, these introns are directly added to the nucleotide dictionary during extraction. The update method is destructive, specifically, if a key exists in both dictionaries, the value assigned to that key will be replaced by the value in the other dictionary. Instead, I believe the intention was to add new entries to each key's list, which is what is now occurring. Edit: I reexamined this, and the dictionaries would not share any keys, so the previous implementation would work as expected. Instead, the new approach is just more streamlined.

michaelgruenstaeudl commented 1 week ago

When performing back-translation, back-translated sequences are only added to the aligned list if the process was successful (not None). Within the translation of individual sequences, the assertions have been replaced with None returns, and None is returned if the nucleotide evaluation was unsuccessful (None was returned). This allows all successful back-translations to be saved to file, instead of an issue with any individual sequence raising an exception and preventing the saving of a MultipleSeqAlignment object to file. Now, the No such file or directory error should only occur if no translations were successful.

Yes, good catch! That is an important improvement indeed.

michaelgruenstaeudl commented 1 week ago

Instead, I believe the intention was to add new entries to each key's list, which is what is now occurring. Edit: I reexamined this, and the dictionaries would not share any keys, so the previous implementation would work as expected. Instead, the new approach is just more streamlined.

Good catch too! While it would be highly unlikely that an CDS key would have the same name as an intron key (and, thus, override it) or vice versa, using dict.add() is a safer approach in any event.

michaelgruenstaeudl commented 1 week ago

Thank you for moving the code regarding the feature extraction of genes, introns, and intergenic spacers into their own classes (i.e., GeneFeature, IntronFeature, IntergenicFeature). The code is much cleaner because of it.

michaelgruenstaeudl / PlastomeBurstAndAlign

Preemptive handling of known raised exceptions & compound location handling for intergenic features #20