Closed nemitheasura closed 3 years ago
Hi @nemitheasura, as a BED file you need to pass the genomic annotation of the transcripts against which you mapped your reads. That's the only way for Nanocompore to be able to associate transcript coordinates (present in the BAM file) to genomic coordinates.
How did you generate the fasta file used for mapping with minimap2? If you produced from a BED file (e.g. Gencode/Ensembl), that's the file you need to pass. If instead you downloaded a transcriptome fasta directly, you need to figure out how it was generated and obtain the corresponding BED file.
Hi @tleonardi, I tried various solutions:
1) I downloaded annotation file, transcriptome and corresponding genome from publicly available databases and passed them to nanocompore - the bed issue occurred;
2) I used the aforementioned files and according to #132 (samir-watson's response) edited headers manually to get rid of pipes (i replaced them with ::) - this also was not the correct solution;
3) I used downloaded transcriptome and genome files, mapped transcriptome to genome (as previously described), generated bam output, which I used to generate alternative bed file this way:
minimap2 -ax splice -k14 -uf --secondary=no genome.fa transcripts.fa | samtools view -F 2304 -b samtools sort -o t2g.bam samtools index t2g.bam bedtools bamtobed -bed12 -i t2g.bam > t2g.bed
4) I've read your answer, took already existing t2g.bed and genome.fa, then used your oneliner mentioned in #132 to generate new transcriptome.fa, which I used in new analysis instead of the first transcriptome reference. This also didn't resolve the problem, as it generated following error:
`File "/usr/local/software/Nanocompore/1.0.2/bin/nanocompore", line 11, in
2021-01-07T11:59:05.636492+0100 ERROR - MainProcess | Some references are missing from the BED file provided
2021-01-07T11:59:05.637260+0100 ERROR - MainProcess | An error occured. Killing all processes and closing queues
sys.exit(main())
File "/usr/local/software/Nanocompore/1.0.2/lib/python3.6/site-packages/nanocompore/main.py", line 175, in main
args.func(args)
File "/usr/local/software/Nanocompore/1.0.2/lib/python3.6/site-packages/nanocompore/main.py", line 213, in sampcomp_main
db = s()
File "/usr/local/software/Nanocompore/1.0.2/lib/python3.6/site-packages/nanocompore/SampComp.py", line 245, in call
raise E
File "/usr/local/software/Nanocompore/1.0.2/lib/python3.6/site-packages/nanocompore/SampComp.py", line 233, in call
bed_fn=self.__bed_fn)
File "/usr/local/software/Nanocompore/1.0.2/lib/python3.6/site-packages/nanocompore/SampCompDB.py", line 101, in init
self.results = self.__calculate_results(adjust=True)
File "/usr/local/software/Nanocompore/1.0.2/lib/python3.6/site-packages/nanocompore/SampCompDB.py", line 159, in __calculate_results
raise NanocomporeError("Some references are missing from the BED file provided")
nanocompore.common.NanocomporeError: Some references are missing from the BED file provided`
Finally - the same problem occurred, when i used dummy genome and dummy transcriptome (just one chromosome and corresponding transcripts): invalid coordinates or references missing from bed file.
I really don't know how to proceed with this (for some tasks I don't really need genomic coordinates, without bed file everything works perfectly fine; unfortunately there is one question I want to answer, which requires comparison of coordinates.
The problems is that bedtools getfasta appends the strand to the transcript name, so in the SAM/BAM alignments, the transcript name contain the strand info. Due to this nanocompore can't match transcript names between the BED and the data coming from the BAM/nanopolish.
The quick solution is to convert transcript names in the BED file so that they also contain the strand:
awk 'BEGIN{OFS=FS="\t"}{$4=$4"("$6")"; print}' reference_transcriptome.bed > reference_transcriptome_renamed.bed
(1) I also get this same exception: "nanocompore.common.NanocomporeError: Some references are missing from the BED file provided", when I run nanocompore sampcomp with the --bed parameter:
nanocompore sampcomp \
--file_list1 fastq_Control_A_eventalign_collapsed_reads.tsv/out_eventalign_collapse.tsv,fastq_Control_B_eventalign_collapsed_reads.tsv/out_eventalign_collapse.tsv,fastq_Control_C_eventalign_collapsed_reads.tsv/out_eventalign_collapse.tsv \
--file_list2 fastq_MySample_A_eventalign_collapsed_reads.tsv/out_eventalign_collapse.tsv,fastq_MySample_B_eventalign_collapsed_reads.tsv/out_eventalign_collapse.tsv,fastq_MySample_C_eventalign_collapsed_reads.tsv/out_eventalign_collapse.tsv \
--label1 Control \
--label2 MySample \
--fasta gencode.v40.transcripts.fa \
--bed gencode.v40.chr_patch_hapl_scaff.annotation.converted_by_gtf2bed.bed \
--outpath sampcomp_results_for_Control_vs_MySample_gencode_v40_chr_patch_hapl_scaff_gtfbed_26Jan2023 \
--progress
(2) The full output is:
2023-01-26T19:12:47.240780+0000 WARNING - MainProcess | Running SampComp
2023-01-26T19:12:47.241264+0000 INFO - MainProcess | Checking and initialising SampComp
2023-01-26T19:12:50.152208+0000 INFO - MainProcess | Reading eventalign index files
2023-01-26T19:15:22.245903+0000 INFO - MainProcess | References found in index: 39554
2023-01-26T19:15:22.246506+0000 INFO - MainProcess | Filtering out references with low coverage
2023-01-26T19:15:28.364108+0000 INFO - MainProcess | References remaining after reference coverage filtering: 1881
2023-01-26T19:15:30.240352+0000 INFO - MainProcess | Starting data processing
100%|██████████| 1881/1881 [66:44:56<00:00, 127.75s/ Processed References]2023-01-29T14:00:27.642460+0000 INFO - Process-3 | All Done. Transcripts processed: 1881
2023-01-29T14:00:28.822045+0000 INFO - MainProcess | Loading SampCompDB
2023-01-29T14:00:29.553411+0000 INFO - MainProcess | Calculate results
Traceback (most recent call last):
File "/home/myusername/miniconda3/envs/nanocompore/bin/nanocompore", line 10, in <module>
2023-01-29T15:24:28.038506+0000 ERROR - MainProcess | Some references are missing from the BED file provided
2023-01-29T15:24:28.039087+0000 ERROR - MainProcess | An error occured. Killing all processes and closing queues
sys.exit(main())
File "/home/myusername/miniconda3/envs/nanocompore/lib/python3.7/site-packages/nanocompore/__main__.py", line 175, in main
args.func(args)
File "/home/myusername/miniconda3/envs/nanocompore/lib/python3.7/site-packages/nanocompore/__main__.py", line 213, in sampcomp_main
db = s()
File "/home/myusername/miniconda3/envs/nanocompore/lib/python3.7/site-packages/nanocompore/SampComp.py", line 251, in __call__
raise E
File "/home/myusername/miniconda3/envs/nanocompore/lib/python3.7/site-packages/nanocompore/SampComp.py", line 239, in __call__
bed_fn=self.__bed_fn)
File "/home/myusername/miniconda3/envs/nanocompore/lib/python3.7/site-packages/nanocompore/SampCompDB.py", line 101, in __init__
self.results = self.__calculate_results(adjust=True)
File "/home/myusername/miniconda3/envs/nanocompore/lib/python3.7/site-packages/nanocompore/SampCompDB.py", line 159, in __calculate_results
raise NanocomporeError("Some references are missing from the BED file provided")
nanocompore.common.NanocomporeError: Some references are missing from the BED file provided
(3) Running this "nanocompore sampcomp" without the "--bed" parameter, finished okay, and the "saved_results_report.tsv" contains:
pos chr genomicPos ref_id strand ref_kmer GMM_logit_pvalue KS_dwell_pvalue KS_intensity_pvalue GMM_cov_type GMM_n_clust cluster_counts Logit_LOR
12 NA NA ENST00000649529.1|ENSG00000187608.10|OTTHUMG00000040777.4|OTTHUMT00000501480.1|ISG15-203|ISG15|637|protein_coding| NA GCAGC 0.982701294413789 0.9999999989119721 0.876167596188685 >
13 NA NA ENST00000649529.1|ENSG00000187608.10|OTTHUMG00000040777.4|OTTHUMT00000501480.1|ISG15-203|ISG15|637|protein_coding| NA CAGCG 0.9597669165883118 0.9999999989119721 0.20809078249225463>
14 NA NA ENST00000649529.1|ENSG00000187608.10|OTTHUMG00000040777.4|OTTHUMT00000501480.1|ISG15-203|ISG15|637|protein_coding|
etc...
(4) For the transcripts input to nanocompore, I used: gencode.v40.transcripts.fa (from: https://www.gencodegenes.org/human/release_40.html ("Transcript sequences | CHR | Nucleotide sequences of all transcripts on the reference chromosomes")
(5) For the gtf file I used: "gencode.v40.chr_patch_hapl_scaff.annotation.gtf" (also from: (from: https://www.gencodegenes.org/human/release_40.html ("Comprehensive gene annotation | It contains the comprehensive gene annotation on the primary assembly (chromosomes and scaffolds) sequence regions. This is a superset of the main annotation file"), which should contain all the transcripts.
(6) I used gffutils 'convert2bed' to convert the gtf to bed format:
pip install gffutils
convert2bed --input=gtf --output=bed < gencode.v40.chr_patch_hapl_scaff.annotation.gtf > gencode.v40.primary_assembly.annotation.converted_by_gtf2bed.bed
The resulting bed file: "gencode.v40.chr_patch_hapl_scaff.annotation.converted_by_gtf2bed.bed" contains:
> GL000009.2 56139 58083 ENSG00000278704.1 . - ENSEMBL UTR . gene_id "ENSG00000278704.1"; transcript_id "ENST00000618686.1"; gene_type "protein_coding"; gene_name "ENSG00000278704"; transcript_type "protein_coding"; transcript_name "ENST00000618686"; exon_number 1; exon_id "ENSE00003753029.1"; level 3; protein_id "ENSP00000484918.1"; transcript_support_level "NA"; tag "basic"; tag "Ensembl_canonical";
> GL000009.2 56139 58376 ENSG00000278704.1 . - ENSEMBL exon . gene_id "ENSG00000278704.1"; transcript_id "ENST00000618686.1"; gene_type "protein_coding"; gene_name "ENSG00000278704"; transcript_type "protein_coding"; transcript_name "ENST00000618686"; exon_number 1; exon_id "ENSE00003753029.1"; level 3; protein_id "ENSP00000484918.1"; transcript_support_level "NA"; tag "basic"; tag "Ensembl_canonical";
> GL000009.2 56139 58376 ENSG00000278704.1 . - ENSEMBL gene . gene_id "ENSG00000278704.1"; gene_type "protein_coding"; gene_name "ENSG00000278704"; level 3;
> GL000009.2 56139 58376 ENSG00000278704.1 . - ENSEMBL transcript . gene_id "ENSG00000278704.1"; transcript_id "ENST00000618686.1"; gene_type "protein_coding"; gene_name "ENSG00000278704"; transcript_type "protein_coding"; transcript_name "ENST00000618686"; level 3; protein_id "ENSP00000484918.1"; transcript_support_level "NA"; tag "basic"; tag "Ensembl_canonical";
> etc...
(7) I've compared the "saved_results_report.tsv" (from running without the --bed parameter) with this "gencode.v40.chr_patch_hapl_scaff.annotation.converted_by_gtf2bed.bed". All the gene_id's and transcript_ids in the saved_results_report.tsv are in "gencode.v40.chr_patch_hapl_scaff.annotation.converted_by_gtf2bed.bed", so I'm not sure why I get the above error.
Is there a way that nanocompore sampcomp could print the missing gene or transcript ids, rather than exiting?
Thank you.
In that "/home/myusername/miniconda3/envs/nanocompore/lib/python3.7/site-packages/nanocompore/SampCompDB.py" at that line 159 where the exception occurs, I've added three lines to try to find the references that are missing. Hopefully this is correct.
except:
raise NanocomporeError("Can't open BED file")
if len(bed_annot) != len(self.ref_id_list):
# These three lines were added by me:
for record_name in self.ref_id_list:
if record_name not in bed_annot:
print("\nERROR: '%s' NOT in bed_annot\n" %(record_name))
raise NanocomporeError("Some references are missing from the BED file provided: len(bed_annot)=%d != len(self.ref_id_list)=%d" %(len(bed_annot), len(self.ref_id_list))
I think the cause is that the ref_ids in my SampCompDB are in the format:
"ENST00000649529.1|ENSG00000187608.10|OTTHUMG00000040777.4|OTTHUMT00000501480.1|ISG15-203|ISG15|637|protein_coding|"
(which are from the input GenCode transcripts file: gencode.v40.transcripts.fa, which has format:
>ENST00000456328.2|ENSG00000223972.5|OTTHUMG00000000961.2|OTTHUMT00000362751.1|DDX11L1-202|DDX11L1|1657|processed_transcript|
GTTAACTTGCCGTCAGCCTTTTCTTTGACCTCTTCTTTCTGTTCATGTGTATTTGCTGTC
TCTTAGCCCAGACTTCCCGTGTCCTTTCCACCGGGCCTTTGAGAGGTCACAGGGTCTTGA
TGCTGTGGTCTTCATCTGCAGGTGTCTGACTTCCAGCAACTGCTGGCCTGTGCCAGGGTG
CAAGCTGAGCACTGGAGTGGAGTTTTCCTGTGGAGAGGAGCCATGCCTAGAGTGGGATGG
etc..
Whereas the bed file "gencode.v40.chr_patch_hapl_scaff.annotation.converted_by_gtf2bed.bed", (from convert2bed of the gencode gtf file), has:
eg: "GL000009.2" or "ENSG00000278704.1" as the id:
GL000009.2 56139 58083 ENSG00000278704.1 . - ENSEMBL UTR . gene_id "ENSG00000278704.1"; transcript_id "ENST00000618686.1"; gene_type "protein_coding"; gene_name "ENSG00000278704"; transcript_type "protein_coding"; transcript
_name "ENST00000618686"; exon_number 1; exon_id "ENSE00003753029.1"; level 3; protein_id "ENSP00000484918.1"; transcript_support_level "NA"; tag "basic"; tag "Ensembl_canonical";
etc....
So I need to get the bed file to use the same transcript ids as in the "gencode.v40.transcripts.fa", but because of splicing, then each transcript in the "gencode.v40.transcripts.fa" uses several exon lines in the GTF file, so the BED file (from the GTF) won't have unique ids for each line. So I'm not sure how to format the bed file to input to "nanocompore sampcomp".
chr1 1013496 1013576 ENSG00000187608.10 . + HAVANA exon . gene_id "ENSG00000187608.10"; transcript_id "ENST00000649529.1"; gene_type "protein_coding"; gene_name "ISG15"; transcript_type "protein_coding"; transcript_name "ISG15-203"; exon_number 1; exon_id "ENSE00003831768.1"; level 2; protein_id "ENSP00000496832.1"; hgnc_id "HGNC:4053"; tag "CAGE_supported_TSS"; tag "basic"; tag "Ensembl_canonical"; tag "MANE_Select"; tag "appris_principal_1"; tag "CCDS"; ccdsid "CCDS6.1"; havana_gene "OTTHUMG00000040777.4"; havana_transcript "OTTHUMT00000501480.1";
chr1 1013496 1014540 ENSG00000187608.10 . + HAVANA transcript . gene_id "ENSG00000187608.10"; transcript_id "ENST00000649529.1"; gene_type "protein_coding"; gene_name "ISG15"; transcript_type "protein_coding"; transcript_name "ISG15-203"; level 2; protein_id "ENSP00000496832.1"; hgnc_id "HGNC:4053"; tag "CAGE_supported_TSS"; tag "basic"; tag "Ensembl_canonical"; tag "MANE_Select"; tag "appris_principal_1"; tag "CCDS"; ccdsid "CCDS6.1"; havana_gene "OTTHUMG00000040777.4"; havana_transcript "OTTHUMT00000501480.1";
chr1 1013573 1013576 ENSG00000187608.10 . + HAVANA CDS 0 gene_id "ENSG00000187608.10"; transcript_id "ENST00000649529.1"; gene_type "protein_coding"; gene_name "ISG15"; transcript_type "protein_coding"; transcript_name "ISG15-203"; exon_number 1; exon_id "ENSE00003831768.1"; level 2; protein_id "ENSP00000496832.1"; hgnc_id "HGNC:4053"; tag "CAGE_supported_TSS"; tag "basic"; tag "Ensembl_canonical"; tag "MANE_Select"; tag "appris_principal_1"; tag "CCDS"; ccdsid "CCDS6.1"; havana_gene "OTTHUMG00000040777.4"; havana_transcript "OTTHUMT00000501480.1";
chr1 1013573 1013576 ENSG00000187608.10 . + HAVANA start_codon 0 gene_id "ENSG00000187608.10"; transcript_id "ENST00000649529.1"; gene_type "protein_coding"; gene_name "ISG15"; transcript_type "protein_coding"; transcript_name "ISG15-203"; exon_number 1; exon_id "ENSE00003831768.1"; level 2; protein_id "ENSP00000496832.1"; hgnc_id "HGNC:4053"; tag "CAGE_supported_TSS"; tag "basic"; tag "Ensembl_canonical"; tag "MANE_Select"; tag "appris_principal_1"; tag "CCDS"; ccdsid "CCDS6.1"; havana_gene "OTTHUMG00000040777.4"; havana_transcript "OTTHUMT00000501480.1";
chr1 1013983 1014004 ENSG00000187608.10 . + HAVANA UTR . gene_id "ENSG00000187608.10"; transcript_id "ENST00000624652.1"; gene_type "protein_coding"; gene_name "ISG15"; transcript_type "protein_coding"; transcript_name "ISG15-201"; exon_number 3; exon_id "ENSE00003756535.1"; level 2; protein_id "ENSP00000485313.1"; transcript_support_level "3"; hgnc_id "HGNC:4053"; tag "mRNA_end_NF"; tag "cds_end_NF"; havana_gene "OTTHUMG00000040777.4"; havana_transcript "OTTHUMT00000479385.2";
chr1 1013983 1014004 ENSG00000187608.10 . + HAVANA UTR . gene_id "ENSG00000187608.10"; transcript_id "ENST00000624697.4"; gene_type "protein_coding"; gene_name "ISG15"; transcript_type "protein_coding"; transcript_name "ISG15-202"; exon_number 3; exon_id "ENSE00003760003.2"; level 2; protein_id "ENSP00000485643.1"; transcript_support_level "3"; hgnc_id "HGNC:4053"; tag "basic"; havana_gene "OTTHUMG00000040777.4"; havana_transcript "OTTHUMT00000479384.2";
chr1 1013983 1014435 ENSG00000187608.10 . + HAVANA exon . gene_id "ENSG00000187608.10"; transcript_id "ENST00000624652.1"; gene_type "protein_coding"; gene_name "ISG15"; transcript_type "protein_coding"; transcript_name "ISG15-201"; exon_number 3; exon_id "ENSE00003756535.1"; level 2; protein_id "ENSP00000485313.1"; transcript_support_level "3"; hgnc_id "HGNC:4053"; tag "mRNA_end_NF"; tag "cds_end_NF"; havana_gene "OTTHUMG00000040777.4"; havana_transcript "OTTHUMT00000479385.2";
chr1 1013983 1014475 ENSG00000187608.10 . + HAVANA CDS 0 gene_id "ENSG00000187608.10"; transcript_id "ENST00000649529.1"; gene_type "protein_coding"; gene_name "ISG15"; transcript_type "protein_coding"; transcript_name "ISG15-203"; exon_number 2; exon_id "ENSE00001480799.4"; level 2; protein_id "ENSP00000496832.1"; hgnc_id "HGNC:4053"; tag "CAGE_supported_TSS"; tag "basic"; tag "Ensembl_canonical"; tag "MANE_Select"; tag "appris_principal_1"; tag "CCDS"; ccdsid "CCDS6.1"; havana_gene "OTTHUMG00000040777.4"; havana_transcript "OTTHUMT00000501480.1";
etc....
I've found a similar question about converting gtf to bed using bedparse, so I'll try that method: https://github.com/tleonardi/nanocompore/issues/132
Hi @tleonardi, I tried to obtain genomic positions in my report data table. I followed the instructions and passed .bed file as an argument, but unfortunately all I can achieve is the same error. I double-checked the headers. My .bed file was prepared by mapping transcriptome vs genome (minimap2; options: -ax splice -k14 -uf --secondary=no; samtools view with -F 2304 -b and sort; finally bedtools bamtobed -bed12).
I have no idea, what may be wrong with coordinates and how is this possible, that bed file does contain coordinates which are absent in fasta used to generate the bespoke bed file.
I paste the error below:
Traceback (most recent call last): File "", line 1, in
File "/usr/local/software/Nanocompore/1.0.2/lib/python3.6/site-packages/nanocompore/SampCompDB.py", line 101, in init
self.results = self.__calculate_results(adjust=True)
File "/usr/local/software/Nanocompore/1.0.2/lib/python3.6/site-packages/nanocompore/SampCompDB.py", line 161, in __calculate_results
df['genomicPos'] = df.apply(lambda row: bed_annot[row['ref_id']].tx2genome(coord=row['pos'], stranded=True),axis=1)
File "/usr/local/software/Nanocompore/1.0.2/lib/python3.6/site-packages/pandas/core/frame.py", line 7552, in apply
return op.get_result()
File "/usr/local/software/Nanocompore/1.0.2/lib/python3.6/site-packages/pandas/core/apply.py", line 185, in get_result
return self.apply_standard()
File "/usr/local/software/Nanocompore/1.0.2/lib/python3.6/site-packages/pandas/core/apply.py", line 276, in apply_standard
results, res_index = self.apply_series_generator()
File "/usr/local/software/Nanocompore/1.0.2/lib/python3.6/site-packages/pandas/core/apply.py", line 305, in apply_series_generator
results[i] = self.f(v)
File "/usr/local/software/Nanocompore/1.0.2/lib/python3.6/site-packages/nanocompore/SampCompDB.py", line 161, in
df['genomicPos'] = df.apply(lambda row: bed_annot[row['ref_id']].tx2genome(coord=row['pos'], stranded=True),axis=1)
File "/usr/local/software/Nanocompore/1.0.2/lib/python3.6/site-packages/bedparse/bedline.py", line 412, in tx2genome
raise BEDexception("This coordinate doesn't exist in the transcript")
bedparse.BEDexception: This coordinate doesn't exist in the transcript
Would you be so kind and help me to solve this issue? Best, Nemi