nf-core / genomeqc

Compare the quality of multiple genomes, along with their annotations.
https://nf-co.re/genomeqc
MIT License
2 stars 8 forks source link

WIP: check `agat_sp_extract_sequences.pl` as an alternative to `GFFREAD` #92

Open FernandoDuarteF opened 1 day ago

FernandoDuarteF commented 1 day ago

I was checking the BUSCO phylogenomic tree protocol, and I noticed that they use the agat_sp_extract_sequences.pl script from AGAT to extract the proteins sequences from a filtered gff.

Might be worth trying.

This issue is related to #37.

FernandoDuarteF commented 22 hours ago

It doesn't work on assests/samplesheet.csv, same SEPP error as in #37. I think it's safe to say that this is a dead end.

Very important to note that this problem is related to the --auto-lineage mode, and doesn't appear when the lineage is set.

I would say that the only options right now are:

FernandoDuarteF commented 21 hours ago

I also tried to flatten multi-line sequences into a single line, as well as checking with different sequence headers, but didn't solve the issue.

FernandoDuarteF commented 5 hours ago

I tried adding fastavalidator but doesn't help. It throws the same error for the outputs of both AGAT longest isoform and the local perl script:

Vespa_velutina.prot.fa.largestIsoform.fa has non-sequence characters in it

Considering it only checks for [A-Za-z], I'm pretty sure it's referring to the "*" in the protein sequences.