[ ] Parallelize the parsing of the input files because there is no reason why the input files should be parsed sequentially if they can be parsed in parallel. Specifically, parallelize the following snippet of function __init__() of class ExtractAndCollect :
for f in files:
log.info(f" parsing GenBank flatfile `{f}`")
rec = SeqIO.read(os.path.join(in_dir, f), 'genbank')
if self.select_mode == 'cds':
self.extract_cds(rec)
if self.select_mode == 'igs':
self.extract_igs(rec)
if self.select_mode == 'int':
self.extract_int(rec, main_odict_intron2)
self.main_odict_nucl.update(main_odict_intron2)
[x] Immediately before the lines of _extract_cds() that extract the sequence of features (i.e., feature.extract(rec).seq), we need to start a decision tree. Here is its pseudocode:
Q1 - Is feature.extract(rec).seq a multiple of 3?
If yes:
extract as-is (earlier command) or translate (later command) via `feature.extract(rec).seq.translate(table=11, cds=True)`
If no:
Q2 - Does sequence start with a start codon ('ATG')?
If yes:
Trim either one or two nucleotides from the back of the sequence to render the sequence a multiple of 3.
Then, extract (earlier command) or translate (later command) via `feature.extract(rec).seq.translate(table=11, cds=True)`
If no:
Log.warning that feature ... does not start with a start codon.
And skip that feature during extraction.
Tasks
[ ] Parallelize the parsing of the input files because there is no reason why the input files should be parsed sequentially if they can be parsed in parallel. Specifically, parallelize the following snippet of function
__init__()
ofclass ExtractAndCollect
:[x] Immediately before the lines of _extract_cds() that extract the sequence of features (i.e.,
feature.extract(rec).seq
), we need to start a decision tree. Here is its pseudocode: