nf-core / nanoseq

Nanopore demultiplexing, QC and alignment pipeline
https://nf-co.re/nanoseq
MIT License
181 stars 81 forks source link

Total alignments : 0 / featureCount Results #115

Open cagataykos opened 3 years ago

cagataykos commented 3 years ago

I have human RNA-Seq dataset it has two different barcodes in the different folder. I aligned with that command minimap2 -ax splice -uf -k14 ref.fa direct-rna.fq > aln.sam I try to quantify and counts using with Subread featureCounts function. In the subread results, there is a problem with one of the bam files. I downloaded reference and gtf files from GENCODE. I checked the bam file with samtools view -H first.bam-second.bam I saw that I followed the same steps for each bam file. In the IGV results, I saw matches and alignment for all bam files.

Do you have any suggestions the solve this problem? What am I doing wrong?

featureCounts -T 8 -a gencode.v38.chr_patch_hapl_scaff.annotation.gtf -g 'transcript_id' -o readcouts.txt bam/*.bam

|| Total alignments : 11214480 || || Successfully assigned alignments : 4051945 (36.1%) || || Running time : 2.67 minutes

|| Total alignments : 0 || || Successfully assigned alignments : 0 || || Running time : 2.89 minutes

I also tried with Salmon in the salmon alignment-based quantification results bam file has huge differences between each other. salmon quant --ont -t reference.fa -l A -a first.bam -o salmon_quant1 Total # of mapped reads : 5465357

of uniquely mapped reads : 328808350000000

ambiguously mapped reads : 2177274 salmon quant --ont -t reference.fa -l A -a second.bam -o salmon_quant2 Completed first pass through the alignment file.

Total # of mapped reads : 3843632

of uniquely mapped reads : 2552463

ambiguously mapped reads : 1291169

casmerifle commented 9 months ago

Apologies for the necro but I've encountered a similar issue and would like to find out the issue. I ran nanoseq using a custom fasta and gtf, opting for stringtie2 to perform transcript quantification. Like @cagataykos, I've had transcripts identified (that is, a subset of the unique transcripts in my gtf) but counts of 0 for all of them.

To troubleshoot, I've verified that: (i) both chromosome naming styles ("chr1" v. "1") result in the same output, (ii) running stringtie2 independently using the intermediate minimap2 output of nanoseq results in the error:

WARNING: no reference transcripts were found for the genomic sequences where reads were mapped!
Please make sure the -G annotation file uses the same naming convention for the genome sequences.

I was considering the possibility that the reference used for minimap2 may not be compatible with the reference used in stringtie2. However, given I supplied to the parameters csv a unique transcriptome fa and gtf, and that transcript identification was still possible by stringtie2 as part of the nanoseq pipeline, I'm not certain how to proceed further.

The csv file I used is as follows:

group,replicate,barcode,input_file,fasta,gtf
HUMAN_FTO_WT_DMSO_M,1,,/home/user/data/nanoseq/raw/PAQ63378_FTO_WT_DMSO_M/PAQ63378_FTO_WT_DMSO_M_basecalled.fastq.gz,/home/user/data/nanoseq/gencode.v44.transcripts.fa,/home/user/data/nanoseq/gencode.v44.annotation.ChrPresent.gtf
HUMAN_FTO_WT_BT2,1,,/home/user/data/nanoseq/raw/PAQ87300_FTO_WT_BT2/PAQ87300_FTO_WT_BT2_basecalled.fastq.gz,/home/user/data/nanoseq/gencode.v44.transcripts.fa,/home/user/data/nanoseq/gencode.v44.annotation.ChrPresent.gtf

Executed this command: nextflow run nf-core/nanoseq --input /home/samuel/data/nanoseq/raw/nanoseq_pars.csv --protocol directRNA --skip_demultiplexing --skip_fusion_analysis -profile singularity

Based on this, it seems like there's an inherent incompatibility with my ref files (genocode) and what stringtie2 is able to process?

For reference, the ref gtf I created (version where naming is "1" not "Chr1"):

1       HAVANA  gene    11869   14409   .       +       .       gene_id "ENSG00000290825.1"; gene_type "lncRNA"; gene_name "DDX11L2"; level 2; tag "overlaps_pseudogene";
1       HAVANA  transcript      11869   14409   .       +       .       gene_id "ENSG00000290825.1"; transcript_id "ENST00000456328.2"; gene_type "lncRNA"; gene_name "DDX11L2"; transcript_type "lncRNA"; transcript_name "DDX11L2-202"; level 2; transcript_support_level "1"; tag "basic"; tag "Ensembl_canonical"; havana_transcript "OTTHUMT00000362751.1";
1       HAVANA  exon    11869   12227   .       +       .       gene_id "ENSG00000290825.1"; transcript_id "ENST00000456328.2"; gene_type "lncRNA"; gene_name "DDX11L2"; transcript_type "lncRNA"; transcript_name "DDX11L2-202"; exon_number 1; exon_id "ENSE00002234944.1"; level 2; transcript_support_level "1"; tag "basic"; tag "Ensembl_canonical"; havana_transcript "OTTHUMT00000362751.1";
1       HAVANA  exon    12613   12721   .       +       .       gene_id "ENSG00000290825.1"; transcript_id "ENST00000456328.2"; gene_type "lncRNA"; gene_name "DDX11L2"; transcript_type "lncRNA"; transcript_name "DDX11L2-202"; exon_number 2; exon_id "ENSE00003582793.1"; level 2; transcript_support_level "1"; tag "basic"; tag "Ensembl_canonical"; havana_transcript "OTTHUMT00000362751.1";
1       HAVANA  exon    13221   14409   .       +       .       gene_id "ENSG00000290825.1"; transcript_id "ENST00000456328.2"; gene_type "lncRNA"; gene_name "DDX11L2"; transcript_type "lncRNA"; transcript_name "DDX11L2-202"; exon_number 3; exon_id "ENSE00002312635.1"; level 2; transcript_support_level "1"; tag "basic"; tag "Ensembl_canonical"; havana_transcript "OTTHUMT00000362751.1";

Would there be an issue with the gencode source and Ensembl content?

Appreciate any help.