jolespin / veba

A modular end-to-end suite for in silico recovery, clustering, and analysis of prokaryotic, microeukaryotic, and viral genomes from metagenomes
GNU Affero General Public License v3.0
69 stars 9 forks source link

[Bug] featureCounts error in binning-eukaryotic.py | ERROR: failed to find the gene identifier attribute in the 9th column of the provided GTF file. #47

Closed javiercnav closed 5 months ago

javiercnav commented 5 months ago

Describe the bug: ERROR: Subread failed to find the gene identifier attribute in the 9th column of the provided GTF file. The specified gene identifier attribute is 'gene_id' An example of attributes included in your GTF annotation is 'ID=k127_9079703_1;partial=00;start_type=ATG;rbs_motif=None;rbs_spacer=None;gc_cont=0.446;conf=98.06;score=17.06;cscore=14.96;sscore=2.10;rscore=-1.19;uscore=-0.92;tscore=4.21;contig_id=k127_9079703;gene_id=k127_9079703_1;gene_biotype=protein_coding;

See the file gene_models.eukaryotic.gff

gene_models eukaryotic gff

Now see the log file for subread indicating that -g should be 'gene_id'

subread_veba

Versions

(base) conda activate VEBA-binning-eukaryotic_env
(VEBA-binning-eukaryotic_env) binning-eukaryotic.py -v
binning-eukaryotic.py v2023.10.16

Command used to produce error: 4featurecounts cat binning_euks/mendo_euks/intermediate/3busco/filtered/genomes/*.gff > binning_euks/mendo_euks/tmp/gene_models.eukaryotic.gff && mkdir -p binning_euks/mendo_euks/tmp/featurecounts && ( /projects/navarro_lab/envs/VEBA-binning-eukaryotic_env/bin/featureCounts -G mendoG_euks.fasta -a binning_euks/mendo_euks/tmp/gene_models.eukaryotic.gff -o binning_euks/mendo_euks/intermediate/4__featurecounts/featurecounts.orfs.tsv -F GTF --tmpDir binning_euks/mendo_euks/tmp/featurecounts -T 30 -g gene_id -t CDS -p --countReadPairs T111_all_s.bam T121_all_s.bam T141_all_s.bam T311_all_s.bam T321_all_s.bam T331_all_s.bam T411_all_s.bam T421_all_s.bam T431_all_s.bam ) && gzip -f binning_euks/mendo_euks/intermediate/4__featurecounts/featurecounts.orfs.tsv && rm -rf binning_euks/mendo_euks/tmp/gene_models.eukaryotic.gff

Please provide the following files: 4__featurecounts.returncode.txt 4__featurecounts.o.txt 4__featurecounts.e.txt

The input files are too large to upload.

Thanks a lot in advance!

jolespin commented 5 months ago

I haven't seen this error before but I'll take a look at it so I can get it resolved before the next update. Can you upload your GFF file?

In the meantime, can you try running this to see if you get the same error?

source activate VEBA-binning-eukaryotic_env

BAM="T111_all_s.bam T121_all_s.bam T141_all_s.bam T311_all_s.bam T321_all_s.bam T331_all_s.bam T411_all_s.bam T421_all_s.bam T431_all_s.bam"

featureCounts --verbose -G mendoG_euks.fasta -a binning_euks/mendo_euks/tmp/gene_models.eukaryotic.gff -o binning_euks/mendo_euks/intermediate/4__featurecounts/featurecounts.orfs.tsv -F GTF --tmpDir binning_euks/mendo_euks/tmp/featurecounts -T 30 -g gene_id -t CDS -p --countReadPairs ${BAM}

I added the verbose flag so maybe that will indicate which record is throwing the error.

javiercnav commented 5 months ago

Thanks a lot for the quick response! I run the command you provided and got the following:

ERROR: failed to find the gene identifier attribute in the 9th column of the provided GTF file. The specified gene identifier attribute is 'gene_id' An example of attributes included in your GTF annotation is 'ID=k127_9079703_1;partial=00;start_type=ATG;rbs_motif=None;rbs_spacer=None;gc_cont=0.446;conf=98.06;score=17.06;cscore=14.96;sscore=2.10;rscore=-1.19;uscore=-0.92;tscore=4.21;contig_id=k127_9079703;gene_id=k127_9079703_1;gene_biotype=protein_coding;;organelle=mitochondrion;'.

The gff file can be downloaded from the following link: https://www.dropbox.com/scl/fi/svhuk6ha1lilbfkl84dyl/gene_models.eukaryotic.gff.txt?rlkey=4cokljnhn01kp69wicmstg4tk&dl=1

javiercnav commented 5 months ago

I tested a second dataset, also for euks, and got the same error... I am wondering if there is a wrongly set variable in my installation in the environment for euks...

ERROR: failed to find the gene identifier attribute in the 9th column of the provided GTF file. The specified gene identifier attribute is 'gene_id'

I used the same dataset and same bam files to bin prokaryotes and viruses and all seems fine there.

jolespin commented 5 months ago

Hmm...ok let's try this.


source activate VEBA-binning-eukaryotic_env

CDS_GFF="gene_models.eukaryotic.CDS.gff"
cat binning_euks/mendo_euks/tmp/gene_models.eukaryotic.gff | grep "MetaEuk" | grep "CDS" > ${CDS_GFF}

BAM="T111_all_s.bam T121_all_s.bam T141_all_s.bam T311_all_s.bam T321_all_s.bam T331_all_s.bam T411_all_s.bam T421_all_s.bam T431_all_s.bam"

featureCounts --verbose -G mendoG_euks.fasta -a ${CDS_GFF} -o binning_euks/mendo_euks/intermediate/4__featurecounts/featurecounts.orfs.tsv -F GTF --tmpDir binning_euks/mendo_euks/tmp/featurecounts -T 30 -g gene_id -t CDS -p --countReadPairs ${BAM}

If that doesn't work then I have a couple of options:

  1. I might need you assembly file and 1 or 2 BAM files to reproduce. Then I can test on these files directly to find the culprit. You could always e-mail them to me separately if you would feel more comfortable.
  2. You can try running a test dataset that I know works:

Try this command if you do:

CMD="source activate VEBA && veba --module binning-eukaryotic --params=\"-f ${FASTA} -b ${BAM} -n ${ID} -p ${N_JOBS} -m 1500 -a metabat2\""

Here's the featurecounts step:

# Program:featureCounts v2.0.3; Command:"/expanse/projects/jcl110/miniconda3/envs/VEBA-binning-eukaryotic_env/bin/featureCounts" "-G" "veba_output/binning/prokaryotic/S1/output/unbinned.fasta" "-a" "veba_output/binning/eukaryotic/S1/tmp/gene_models.eukaryotic.gff" "-o" "veba_output/binning/eukaryotic/S1/intermediate/4__featurecounts/featurecounts.orfs.tsv" "-F" "GTF" "--tmpDir" "veba_output/binning/eukaryotic/S1/tmp/featurecounts" "-T" "1" "-g" "gene_id" "-t" "CDS" "-p" "--countReadPairs" "veba_output/assembly/S1/output/mapped.sorted.bam"

S1_unbinned_for_eukaryotic.fasta.gz S1_1.fastq.gz S1_2.fastq.gz S1_mapped.sorted.bam

  1. I plan on updating the repository with 1.5.0 in a few days if you want to wait for that and try again but I have a feeling you'll get the same error because I haven't touched the featureCounts part of the euk binning module.
javiercnav commented 5 months ago

Hello there; The VEBA version is 1.2.0 The code you shared for my dataset did the job. Feature counts were generated after subsetting CDS & Metaeuk. Now, I will try to figure out what happened with the formatting of my gff files. I also ran the binning_eukaryotic using the files you provided, and the run finished without any issues. Thanks a lot!

jolespin commented 5 months ago

No problem! Glad we got this resolved and thanks again for bringing this issue to my attention. At least I know that filtering for just the CDS solves the issue. I'll keep my eye out for this for future versions. Will be an easy fix to just grep CDS and MetaEuk into a tmp GFF and use that if necessary.