Open cagataykos opened 3 years ago
Apologies for the necro but I've encountered a similar issue and would like to find out the issue. I ran nanoseq using a custom fasta and gtf, opting for stringtie2 to perform transcript quantification. Like @cagataykos, I've had transcripts identified (that is, a subset of the unique transcripts in my gtf) but counts of 0 for all of them.
To troubleshoot, I've verified that: (i) both chromosome naming styles ("chr1" v. "1") result in the same output, (ii) running stringtie2 independently using the intermediate minimap2 output of nanoseq results in the error:
WARNING: no reference transcripts were found for the genomic sequences where reads were mapped!
Please make sure the -G annotation file uses the same naming convention for the genome sequences.
I was considering the possibility that the reference used for minimap2 may not be compatible with the reference used in stringtie2. However, given I supplied to the parameters csv a unique transcriptome fa and gtf, and that transcript identification was still possible by stringtie2 as part of the nanoseq pipeline, I'm not certain how to proceed further.
The csv file I used is as follows:
group,replicate,barcode,input_file,fasta,gtf
HUMAN_FTO_WT_DMSO_M,1,,/home/user/data/nanoseq/raw/PAQ63378_FTO_WT_DMSO_M/PAQ63378_FTO_WT_DMSO_M_basecalled.fastq.gz,/home/user/data/nanoseq/gencode.v44.transcripts.fa,/home/user/data/nanoseq/gencode.v44.annotation.ChrPresent.gtf
HUMAN_FTO_WT_BT2,1,,/home/user/data/nanoseq/raw/PAQ87300_FTO_WT_BT2/PAQ87300_FTO_WT_BT2_basecalled.fastq.gz,/home/user/data/nanoseq/gencode.v44.transcripts.fa,/home/user/data/nanoseq/gencode.v44.annotation.ChrPresent.gtf
Executed this command:
nextflow run nf-core/nanoseq --input /home/samuel/data/nanoseq/raw/nanoseq_pars.csv --protocol directRNA --skip_demultiplexing --skip_fusion_analysis -profile singularity
Based on this, it seems like there's an inherent incompatibility with my ref files (genocode) and what stringtie2 is able to process?
For reference, the ref gtf I created (version where naming is "1" not "Chr1"):
1 HAVANA gene 11869 14409 . + . gene_id "ENSG00000290825.1"; gene_type "lncRNA"; gene_name "DDX11L2"; level 2; tag "overlaps_pseudogene";
1 HAVANA transcript 11869 14409 . + . gene_id "ENSG00000290825.1"; transcript_id "ENST00000456328.2"; gene_type "lncRNA"; gene_name "DDX11L2"; transcript_type "lncRNA"; transcript_name "DDX11L2-202"; level 2; transcript_support_level "1"; tag "basic"; tag "Ensembl_canonical"; havana_transcript "OTTHUMT00000362751.1";
1 HAVANA exon 11869 12227 . + . gene_id "ENSG00000290825.1"; transcript_id "ENST00000456328.2"; gene_type "lncRNA"; gene_name "DDX11L2"; transcript_type "lncRNA"; transcript_name "DDX11L2-202"; exon_number 1; exon_id "ENSE00002234944.1"; level 2; transcript_support_level "1"; tag "basic"; tag "Ensembl_canonical"; havana_transcript "OTTHUMT00000362751.1";
1 HAVANA exon 12613 12721 . + . gene_id "ENSG00000290825.1"; transcript_id "ENST00000456328.2"; gene_type "lncRNA"; gene_name "DDX11L2"; transcript_type "lncRNA"; transcript_name "DDX11L2-202"; exon_number 2; exon_id "ENSE00003582793.1"; level 2; transcript_support_level "1"; tag "basic"; tag "Ensembl_canonical"; havana_transcript "OTTHUMT00000362751.1";
1 HAVANA exon 13221 14409 . + . gene_id "ENSG00000290825.1"; transcript_id "ENST00000456328.2"; gene_type "lncRNA"; gene_name "DDX11L2"; transcript_type "lncRNA"; transcript_name "DDX11L2-202"; exon_number 3; exon_id "ENSE00002312635.1"; level 2; transcript_support_level "1"; tag "basic"; tag "Ensembl_canonical"; havana_transcript "OTTHUMT00000362751.1";
Would there be an issue with the gencode source and Ensembl content?
Appreciate any help.
I have human RNA-Seq dataset it has two different barcodes in the different folder. I aligned with that command minimap2 -ax splice -uf -k14 ref.fa direct-rna.fq > aln.sam I try to quantify and counts using with Subread featureCounts function. In the subread results, there is a problem with one of the bam files. I downloaded reference and gtf files from GENCODE. I checked the bam file with samtools view -H first.bam-second.bam I saw that I followed the same steps for each bam file. In the IGV results, I saw matches and alignment for all bam files.
Do you have any suggestions the solve this problem? What am I doing wrong?
featureCounts -T 8 -a gencode.v38.chr_patch_hapl_scaff.annotation.gtf -g 'transcript_id' -o readcouts.txt bam/*.bam
|| Total alignments : 11214480 || || Successfully assigned alignments : 4051945 (36.1%) || || Running time : 2.67 minutes
|| Total alignments : 0 || || Successfully assigned alignments : 0 || || Running time : 2.89 minutes
I also tried with Salmon in the salmon alignment-based quantification results bam file has huge differences between each other. salmon quant --ont -t reference.fa -l A -a first.bam -o salmon_quant1 Total # of mapped reads : 5465357
of uniquely mapped reads : 328808350000000
ambiguously mapped reads : 2177274 salmon quant --ont -t reference.fa -l A -a second.bam -o salmon_quant2 Completed first pass through the alignment file.
Total # of mapped reads : 3843632
of uniquely mapped reads : 2552463
ambiguously mapped reads : 1291169