Closed samuelortion closed 9 months ago
I thanks for reporting this, we will have a look at it as soon as we can. @FerriolCalvet
Sorry for the late reply and thank you @UncleSamulus for opening the issue. (thank you also @emiliorighi)
The way we developed Geneidx was thinking on the annotation of entire genomes so that we could take advantage of the identification of lots of protein-to-DNA alignments to infer some of the parameter files needed for running the gene prediction step and also use them as hints. Since you are using a quite small genomic sequence, there might not be introns or maybe there are not enough protein-DNA alignments that pass the quality filters and then can be used to define the parameters.
I am not familiar with the sequence you are using but since the tool is quite fast to run I would probably run the entire genome and then look into this sequence to see which genes are annotated.
Running the entire genome is also better because it should reduce the amount of missaligned proteins, since with a smaller sequence maybe some proteins' best match is not the same one as if the entire genome was there.
Right now I cannot think of any alternative apart from running the entire genome, but we working on updating our tool so that if the user provides an input where it cannot take advantage of the autotraining capabilities, geneidx will provide an annotation of the provided sequence using only geneid (https://github.com/guigolab/geneid) and the closest pre-trained parameter file.
Let me know if you have more doubts!
Thanks!
Ferriol
Thank you for your reply. This sequence comes from a genome of a variety of wheat. I am not sure whether I could reasonably run your pipeline on the whole wheat genome, as I do not have access to any HPC platform :thinking:.
I'll take a look on geneid. Maybe a subset of the wheat genome would be enough to train its parameters, while being tractable by my machine.
Yes, maybe try running only 5 or 6 main chromosomes (sorry I don't know how big/small is this genome) and it will probably work better.
@UncleSamulus I am closing this issue, feel free to open another one if you encounter other problems
Hello, I wanted to try your pipeline on a small Wheat sequence, but I faced several issues, starting from
pyComputeIntrons
process. Becauseintron_l
is empty (I posit it's because I do not have introns in this little sequence?), the pandas indexing returns a Keyerror ondd_intron = dd_intron[[0,4,1,3]]
. I fixed this in a dirty way by specifyingcolumns=range(5)
in the previous pd.DataFrame call.Then,
findORF
fails, because<name>_gffread.fa
is empty (so not a fasta file):I am not sure what I could do to avoid this issue.
For reference, the sequence I used is available on https://framagit.org/geniomhe/genomics/structural-genomics-project-r4-2023/-/blob/main/data/region4.fasta, and I used the following command:
on the following TSV: