guigolab / geneidx

Nextflow pipeline for genome annotation of protein-coding genes
GNU General Public License v3.0
16 stars 2 forks source link

Failing on sequence without introns? #6

Closed samuelortion closed 9 months ago

samuelortion commented 9 months ago

Hello, I wanted to try your pipeline on a small Wheat sequence, but I faced several issues, starting from pyComputeIntrons process. Because intron_l is empty (I posit it's because I do not have introns in this little sequence?), the pandas indexing returns a Keyerror on dd_intron = dd_intron[[0,4,1,3]]. I fixed this in a dirty way by specifying columns=range(5) in the previous pd.DataFrame call.

Then, findORF fails, because <name>_gffread.fa is empty (so not a fasta file):

RuntimeError: <name>_gffread.fa is not fasta or fastq sequence file

I am not sure what I could do to avoid this issue.

For reference, the sequence I used is available on https://framagit.org/geniomhe/genomics/structural-genomics-project-r4-2023/-/blob/main/data/region4.fasta, and I used the following command:

nextflow run geneidx/main.nf -profile docker \
    --tsv ./data/geneidx.tsv \
    --column_taxid_value TAXID \
    --column_path_value PATH \
    --column_id_value ID \
    --outdir results/geneidx

on the following TSV:

ID  PATH    TAXID
region4 data/region4.fasta.gz   4565
emiliorighi commented 9 months ago

I thanks for reporting this, we will have a look at it as soon as we can. @FerriolCalvet

FerriolCalvet commented 9 months ago

Sorry for the late reply and thank you @UncleSamulus for opening the issue. (thank you also @emiliorighi)

The way we developed Geneidx was thinking on the annotation of entire genomes so that we could take advantage of the identification of lots of protein-to-DNA alignments to infer some of the parameter files needed for running the gene prediction step and also use them as hints. Since you are using a quite small genomic sequence, there might not be introns or maybe there are not enough protein-DNA alignments that pass the quality filters and then can be used to define the parameters.

I am not familiar with the sequence you are using but since the tool is quite fast to run I would probably run the entire genome and then look into this sequence to see which genes are annotated.

Running the entire genome is also better because it should reduce the amount of missaligned proteins, since with a smaller sequence maybe some proteins' best match is not the same one as if the entire genome was there.

Right now I cannot think of any alternative apart from running the entire genome, but we working on updating our tool so that if the user provides an input where it cannot take advantage of the autotraining capabilities, geneidx will provide an annotation of the provided sequence using only geneid (https://github.com/guigolab/geneid) and the closest pre-trained parameter file.

Let me know if you have more doubts!

Thanks!

Ferriol

samuelortion commented 9 months ago

Thank you for your reply. This sequence comes from a genome of a variety of wheat. I am not sure whether I could reasonably run your pipeline on the whole wheat genome, as I do not have access to any HPC platform :thinking:.

I'll take a look on geneid. Maybe a subset of the wheat genome would be enough to train its parameters, while being tractable by my machine.

FerriolCalvet commented 9 months ago

Yes, maybe try running only 5 or 6 main chromosomes (sorry I don't know how big/small is this genome) and it will probably work better.

emiliorighi commented 9 months ago

@UncleSamulus I am closing this issue, feel free to open another one if you encounter other problems