tleonardi / nanocompore

RNA modifications detection from Nanopore dRNA-Seq data
https://nanocompore.rna.rocks
GNU General Public License v3.0
78 stars 12 forks source link

Error with parsing bed file #176

Closed nemitheasura closed 3 years ago

nemitheasura commented 3 years ago

Hi @tleonardi, I tried to obtain genomic positions in my report data table. I followed the instructions and passed .bed file as an argument, but unfortunately all I can achieve is the same error. I double-checked the headers. My .bed file was prepared by mapping transcriptome vs genome (minimap2; options: -ax splice -k14 -uf --secondary=no; samtools view with -F 2304 -b and sort; finally bedtools bamtobed -bed12).

I have no idea, what may be wrong with coordinates and how is this possible, that bed file does contain coordinates which are absent in fasta used to generate the bespoke bed file.

I paste the error below:

Traceback (most recent call last): File "", line 1, in File "/usr/local/software/Nanocompore/1.0.2/lib/python3.6/site-packages/nanocompore/SampCompDB.py", line 101, in init self.results = self.__calculate_results(adjust=True) File "/usr/local/software/Nanocompore/1.0.2/lib/python3.6/site-packages/nanocompore/SampCompDB.py", line 161, in __calculate_results df['genomicPos'] = df.apply(lambda row: bed_annot[row['ref_id']].tx2genome(coord=row['pos'], stranded=True),axis=1) File "/usr/local/software/Nanocompore/1.0.2/lib/python3.6/site-packages/pandas/core/frame.py", line 7552, in apply return op.get_result() File "/usr/local/software/Nanocompore/1.0.2/lib/python3.6/site-packages/pandas/core/apply.py", line 185, in get_result return self.apply_standard() File "/usr/local/software/Nanocompore/1.0.2/lib/python3.6/site-packages/pandas/core/apply.py", line 276, in apply_standard results, res_index = self.apply_series_generator() File "/usr/local/software/Nanocompore/1.0.2/lib/python3.6/site-packages/pandas/core/apply.py", line 305, in apply_series_generator results[i] = self.f(v) File "/usr/local/software/Nanocompore/1.0.2/lib/python3.6/site-packages/nanocompore/SampCompDB.py", line 161, in df['genomicPos'] = df.apply(lambda row: bed_annot[row['ref_id']].tx2genome(coord=row['pos'], stranded=True),axis=1) File "/usr/local/software/Nanocompore/1.0.2/lib/python3.6/site-packages/bedparse/bedline.py", line 412, in tx2genome raise BEDexception("This coordinate doesn't exist in the transcript") bedparse.BEDexception: This coordinate doesn't exist in the transcript

Would you be so kind and help me to solve this issue? Best, Nemi

tleonardi commented 3 years ago

Hi @nemitheasura, as a BED file you need to pass the genomic annotation of the transcripts against which you mapped your reads. That's the only way for Nanocompore to be able to associate transcript coordinates (present in the BAM file) to genomic coordinates.

How did you generate the fasta file used for mapping with minimap2? If you produced from a BED file (e.g. Gencode/Ensembl), that's the file you need to pass. If instead you downloaded a transcriptome fasta directly, you need to figure out how it was generated and obtain the corresponding BED file.

nemitheasura commented 3 years ago

Hi @tleonardi, I tried various solutions:

1) I downloaded annotation file, transcriptome and corresponding genome from publicly available databases and passed them to nanocompore - the bed issue occurred;

2) I used the aforementioned files and according to #132 (samir-watson's response) edited headers manually to get rid of pipes (i replaced them with ::) - this also was not the correct solution;

3) I used downloaded transcriptome and genome files, mapped transcriptome to genome (as previously described), generated bam output, which I used to generate alternative bed file this way:

minimap2 -ax splice -k14 -uf --secondary=no genome.fa transcripts.fa | samtools view -F 2304 -b samtools sort -o t2g.bam samtools index t2g.bam bedtools bamtobed -bed12 -i t2g.bam > t2g.bed

4) I've read your answer, took already existing t2g.bed and genome.fa, then used your oneliner mentioned in #132 to generate new transcriptome.fa, which I used in new analysis instead of the first transcriptome reference. This also didn't resolve the problem, as it generated following error:

`File "/usr/local/software/Nanocompore/1.0.2/bin/nanocompore", line 11, in
2021-01-07T11:59:05.636492+0100 ERROR - MainProcess | Some references are missing from the BED file provided
2021-01-07T11:59:05.637260+0100 ERROR - MainProcess | An error occured. Killing all processes and closing queues

sys.exit(main())
File "/usr/local/software/Nanocompore/1.0.2/lib/python3.6/site-packages/nanocompore/main.py", line 175, in main
args.func(args)
File "/usr/local/software/Nanocompore/1.0.2/lib/python3.6/site-packages/nanocompore/main.py", line 213, in sampcomp_main
db = s()
File "/usr/local/software/Nanocompore/1.0.2/lib/python3.6/site-packages/nanocompore/SampComp.py", line 245, in call
raise E
File "/usr/local/software/Nanocompore/1.0.2/lib/python3.6/site-packages/nanocompore/SampComp.py", line 233, in call
bed_fn=self.__bed_fn)
File "/usr/local/software/Nanocompore/1.0.2/lib/python3.6/site-packages/nanocompore/SampCompDB.py", line 101, in init
self.results = self.__calculate_results(adjust=True)
File "/usr/local/software/Nanocompore/1.0.2/lib/python3.6/site-packages/nanocompore/SampCompDB.py", line 159, in __calculate_results
raise NanocomporeError("Some references are missing from the BED file provided")
nanocompore.common.NanocomporeError: Some references are missing from the BED file provided`

Finally - the same problem occurred, when i used dummy genome and dummy transcriptome (just one chromosome and corresponding transcripts): invalid coordinates or references missing from bed file.

I really don't know how to proceed with this (for some tasks I don't really need genomic coordinates, without bed file everything works perfectly fine; unfortunately there is one question I want to answer, which requires comparison of coordinates.

tleonardi commented 3 years ago

The problems is that bedtools getfasta appends the strand to the transcript name, so in the SAM/BAM alignments, the transcript name contain the strand info. Due to this nanocompore can't match transcript names between the BED and the data coming from the BAM/nanopolish. The quick solution is to convert transcript names in the BED file so that they also contain the strand: awk 'BEGIN{OFS=FS="\t"}{$4=$4"("$6")"; print}' reference_transcriptome.bed > reference_transcriptome_renamed.bed

sbridgett commented 1 year ago

(1) I also get this same exception: "nanocompore.common.NanocomporeError: Some references are missing from the BED file provided", when I run nanocompore sampcomp with the --bed parameter:

nanocompore sampcomp \
  --file_list1 fastq_Control_A_eventalign_collapsed_reads.tsv/out_eventalign_collapse.tsv,fastq_Control_B_eventalign_collapsed_reads.tsv/out_eventalign_collapse.tsv,fastq_Control_C_eventalign_collapsed_reads.tsv/out_eventalign_collapse.tsv \
  --file_list2 fastq_MySample_A_eventalign_collapsed_reads.tsv/out_eventalign_collapse.tsv,fastq_MySample_B_eventalign_collapsed_reads.tsv/out_eventalign_collapse.tsv,fastq_MySample_C_eventalign_collapsed_reads.tsv/out_eventalign_collapse.tsv \
  --label1 Control \
  --label2 MySample \
  --fasta gencode.v40.transcripts.fa \
  --bed gencode.v40.chr_patch_hapl_scaff.annotation.converted_by_gtf2bed.bed \
  --outpath sampcomp_results_for_Control_vs_MySample_gencode_v40_chr_patch_hapl_scaff_gtfbed_26Jan2023 \
  --progress

(2) The full output is:

2023-01-26T19:12:47.240780+0000 WARNING - MainProcess | Running SampComp
2023-01-26T19:12:47.241264+0000 INFO - MainProcess | Checking and initialising SampComp
2023-01-26T19:12:50.152208+0000 INFO - MainProcess | Reading eventalign index files
2023-01-26T19:15:22.245903+0000 INFO - MainProcess |    References found in index: 39554
2023-01-26T19:15:22.246506+0000 INFO - MainProcess | Filtering out references with low coverage
2023-01-26T19:15:28.364108+0000 INFO - MainProcess |    References remaining after reference coverage filtering: 1881
2023-01-26T19:15:30.240352+0000 INFO - MainProcess | Starting data processing
100%|██████████| 1881/1881 [66:44:56<00:00, 127.75s/ Processed References]2023-01-29T14:00:27.642460+0000 INFO - Process-3 | All Done. Transcripts processed: 1881

2023-01-29T14:00:28.822045+0000 INFO - MainProcess | Loading SampCompDB
2023-01-29T14:00:29.553411+0000 INFO - MainProcess | Calculate results
Traceback (most recent call last):
  File "/home/myusername/miniconda3/envs/nanocompore/bin/nanocompore", line 10, in <module>
2023-01-29T15:24:28.038506+0000 ERROR - MainProcess | Some references are missing from the BED file provided
2023-01-29T15:24:28.039087+0000 ERROR - MainProcess | An error occured. Killing all processes and closing queues

    sys.exit(main())
  File "/home/myusername/miniconda3/envs/nanocompore/lib/python3.7/site-packages/nanocompore/__main__.py", line 175, in main
    args.func(args)
  File "/home/myusername/miniconda3/envs/nanocompore/lib/python3.7/site-packages/nanocompore/__main__.py", line 213, in sampcomp_main
    db = s()
  File "/home/myusername/miniconda3/envs/nanocompore/lib/python3.7/site-packages/nanocompore/SampComp.py", line 251, in __call__
    raise E
  File "/home/myusername/miniconda3/envs/nanocompore/lib/python3.7/site-packages/nanocompore/SampComp.py", line 239, in __call__
    bed_fn=self.__bed_fn)
  File "/home/myusername/miniconda3/envs/nanocompore/lib/python3.7/site-packages/nanocompore/SampCompDB.py", line 101, in __init__
    self.results = self.__calculate_results(adjust=True)
  File "/home/myusername/miniconda3/envs/nanocompore/lib/python3.7/site-packages/nanocompore/SampCompDB.py", line 159, in __calculate_results
    raise NanocomporeError("Some references are missing from the BED file provided")
nanocompore.common.NanocomporeError: Some references are missing from the BED file provided

(3) Running this "nanocompore sampcomp" without the "--bed" parameter, finished okay, and the "saved_results_report.tsv" contains:

pos     chr     genomicPos      ref_id  strand  ref_kmer        GMM_logit_pvalue        KS_dwell_pvalue KS_intensity_pvalue     GMM_cov_type    GMM_n_clust     cluster_counts  Logit_LOR
12      NA      NA      ENST00000649529.1|ENSG00000187608.10|OTTHUMG00000040777.4|OTTHUMT00000501480.1|ISG15-203|ISG15|637|protein_coding|      NA      GCAGC   0.982701294413789       0.9999999989119721      0.876167596188685  >
13      NA      NA      ENST00000649529.1|ENSG00000187608.10|OTTHUMG00000040777.4|OTTHUMT00000501480.1|ISG15-203|ISG15|637|protein_coding|      NA      CAGCG   0.9597669165883118      0.9999999989119721      0.20809078249225463>
14      NA      NA      ENST00000649529.1|ENSG00000187608.10|OTTHUMG00000040777.4|OTTHUMT00000501480.1|ISG15-203|ISG15|637|protein_coding|
etc...

(4) For the transcripts input to nanocompore, I used: gencode.v40.transcripts.fa (from: https://www.gencodegenes.org/human/release_40.html ("Transcript sequences | CHR | Nucleotide sequences of all transcripts on the reference chromosomes")

(5) For the gtf file I used: "gencode.v40.chr_patch_hapl_scaff.annotation.gtf" (also from: (from: https://www.gencodegenes.org/human/release_40.html ("Comprehensive gene annotation | It contains the comprehensive gene annotation on the primary assembly (chromosomes and scaffolds) sequence regions. This is a superset of the main annotation file"), which should contain all the transcripts.

(6) I used gffutils 'convert2bed' to convert the gtf to bed format:

pip install gffutils

convert2bed --input=gtf --output=bed  < gencode.v40.chr_patch_hapl_scaff.annotation.gtf > gencode.v40.primary_assembly.annotation.converted_by_gtf2bed.bed

The resulting bed file: "gencode.v40.chr_patch_hapl_scaff.annotation.converted_by_gtf2bed.bed" contains:

> GL000009.2      56139   58083   ENSG00000278704.1       .       -       ENSEMBL UTR     .       gene_id "ENSG00000278704.1"; transcript_id "ENST00000618686.1"; gene_type "protein_coding"; gene_name "ENSG00000278704"; transcript_type "protein_coding"; transcript_name "ENST00000618686"; exon_number 1; exon_id "ENSE00003753029.1"; level 3; protein_id "ENSP00000484918.1"; transcript_support_level "NA"; tag "basic"; tag "Ensembl_canonical";
> GL000009.2      56139   58376   ENSG00000278704.1       .       -       ENSEMBL exon    .       gene_id "ENSG00000278704.1"; transcript_id "ENST00000618686.1"; gene_type "protein_coding"; gene_name "ENSG00000278704"; transcript_type "protein_coding"; transcript_name "ENST00000618686"; exon_number 1; exon_id "ENSE00003753029.1"; level 3; protein_id "ENSP00000484918.1"; transcript_support_level "NA"; tag "basic"; tag "Ensembl_canonical";
> GL000009.2      56139   58376   ENSG00000278704.1       .       -       ENSEMBL gene    .       gene_id "ENSG00000278704.1"; gene_type "protein_coding"; gene_name "ENSG00000278704"; level 3;
> GL000009.2      56139   58376   ENSG00000278704.1       .       -       ENSEMBL transcript      .       gene_id "ENSG00000278704.1"; transcript_id "ENST00000618686.1"; gene_type "protein_coding"; gene_name "ENSG00000278704"; transcript_type "protein_coding"; transcript_name "ENST00000618686"; level 3; protein_id "ENSP00000484918.1"; transcript_support_level "NA"; tag "basic"; tag "Ensembl_canonical";
> etc...

(7) I've compared the "saved_results_report.tsv" (from running without the --bed parameter) with this "gencode.v40.chr_patch_hapl_scaff.annotation.converted_by_gtf2bed.bed". All the gene_id's and transcript_ids in the saved_results_report.tsv are in "gencode.v40.chr_patch_hapl_scaff.annotation.converted_by_gtf2bed.bed", so I'm not sure why I get the above error.

Is there a way that nanocompore sampcomp could print the missing gene or transcript ids, rather than exiting?

Thank you.

sbridgett commented 1 year ago

In that "/home/myusername/miniconda3/envs/nanocompore/lib/python3.7/site-packages/nanocompore/SampCompDB.py" at that line 159 where the exception occurs, I've added three lines to try to find the references that are missing. Hopefully this is correct.

           except:
                raise NanocomporeError("Can't open BED file")
            if len(bed_annot) != len(self.ref_id_list):
                # These three lines were added by me:
                for record_name in self.ref_id_list:
                     if record_name not in bed_annot:
                          print("\nERROR: '%s' NOT in bed_annot\n" %(record_name))

                raise NanocomporeError("Some references are missing from the BED file provided: len(bed_annot)=%d  !=  len(self.ref_id_list)=%d" %(len(bed_annot), len(self.ref_id_list))
sbridgett commented 1 year ago

I think the cause is that the ref_ids in my SampCompDB are in the format:

"ENST00000649529.1|ENSG00000187608.10|OTTHUMG00000040777.4|OTTHUMT00000501480.1|ISG15-203|ISG15|637|protein_coding|"

(which are from the input GenCode transcripts file: gencode.v40.transcripts.fa, which has format:

>ENST00000456328.2|ENSG00000223972.5|OTTHUMG00000000961.2|OTTHUMT00000362751.1|DDX11L1-202|DDX11L1|1657|processed_transcript|
GTTAACTTGCCGTCAGCCTTTTCTTTGACCTCTTCTTTCTGTTCATGTGTATTTGCTGTC
TCTTAGCCCAGACTTCCCGTGTCCTTTCCACCGGGCCTTTGAGAGGTCACAGGGTCTTGA
TGCTGTGGTCTTCATCTGCAGGTGTCTGACTTCCAGCAACTGCTGGCCTGTGCCAGGGTG
CAAGCTGAGCACTGGAGTGGAGTTTTCCTGTGGAGAGGAGCCATGCCTAGAGTGGGATGG
etc..

Whereas the bed file "gencode.v40.chr_patch_hapl_scaff.annotation.converted_by_gtf2bed.bed", (from convert2bed of the gencode gtf file), has:

eg: "GL000009.2" or "ENSG00000278704.1" as the id:

GL000009.2      56139   58083   ENSG00000278704.1       .       -       ENSEMBL UTR     .       gene_id "ENSG00000278704.1"; transcript_id "ENST00000618686.1"; gene_type "protein_coding"; gene_name "ENSG00000278704"; transcript_type "protein_coding"; transcript
_name "ENST00000618686"; exon_number 1; exon_id "ENSE00003753029.1"; level 3; protein_id "ENSP00000484918.1"; transcript_support_level "NA"; tag "basic"; tag "Ensembl_canonical";
etc....

So I need to get the bed file to use the same transcript ids as in the "gencode.v40.transcripts.fa", but because of splicing, then each transcript in the "gencode.v40.transcripts.fa" uses several exon lines in the GTF file, so the BED file (from the GTF) won't have unique ids for each line. So I'm not sure how to format the bed file to input to "nanocompore sampcomp".

chr1    1013496 1013576 ENSG00000187608.10      .       +       HAVANA  exon    .       gene_id "ENSG00000187608.10"; transcript_id "ENST00000649529.1"; gene_type "protein_coding"; gene_name "ISG15"; transcript_type "protein_coding"; transcript_name "ISG15-203"; exon_number 1; exon_id "ENSE00003831768.1"; level 2; protein_id "ENSP00000496832.1"; hgnc_id "HGNC:4053"; tag "CAGE_supported_TSS"; tag "basic"; tag "Ensembl_canonical"; tag "MANE_Select"; tag "appris_principal_1"; tag "CCDS"; ccdsid "CCDS6.1"; havana_gene "OTTHUMG00000040777.4"; havana_transcript "OTTHUMT00000501480.1";

chr1    1013496 1014540 ENSG00000187608.10      .       +       HAVANA  transcript      .       gene_id "ENSG00000187608.10"; transcript_id "ENST00000649529.1"; gene_type "protein_coding"; gene_name "ISG15"; transcript_type "protein_coding"; transcript_name "ISG15-203"; level 2; protein_id "ENSP00000496832.1"; hgnc_id "HGNC:4053"; tag "CAGE_supported_TSS"; tag "basic"; tag "Ensembl_canonical"; tag "MANE_Select"; tag "appris_principal_1"; tag "CCDS"; ccdsid "CCDS6.1"; havana_gene "OTTHUMG00000040777.4"; havana_transcript "OTTHUMT00000501480.1";

chr1    1013573 1013576 ENSG00000187608.10      .       +       HAVANA  CDS     0       gene_id "ENSG00000187608.10"; transcript_id "ENST00000649529.1"; gene_type "protein_coding"; gene_name "ISG15"; transcript_type "protein_coding"; transcript_name "ISG15-203"; exon_number 1; exon_id "ENSE00003831768.1"; level 2; protein_id "ENSP00000496832.1"; hgnc_id "HGNC:4053"; tag "CAGE_supported_TSS"; tag "basic"; tag "Ensembl_canonical"; tag "MANE_Select"; tag "appris_principal_1"; tag "CCDS"; ccdsid "CCDS6.1"; havana_gene "OTTHUMG00000040777.4"; havana_transcript "OTTHUMT00000501480.1";

chr1    1013573 1013576 ENSG00000187608.10      .       +       HAVANA  start_codon     0       gene_id "ENSG00000187608.10"; transcript_id "ENST00000649529.1"; gene_type "protein_coding"; gene_name "ISG15"; transcript_type "protein_coding"; transcript_name "ISG15-203"; exon_number 1; exon_id "ENSE00003831768.1"; level 2; protein_id "ENSP00000496832.1"; hgnc_id "HGNC:4053"; tag "CAGE_supported_TSS"; tag "basic"; tag "Ensembl_canonical"; tag "MANE_Select"; tag "appris_principal_1"; tag "CCDS"; ccdsid "CCDS6.1"; havana_gene "OTTHUMG00000040777.4"; havana_transcript "OTTHUMT00000501480.1";

chr1    1013983 1014004 ENSG00000187608.10      .       +       HAVANA  UTR     .       gene_id "ENSG00000187608.10"; transcript_id "ENST00000624652.1"; gene_type "protein_coding"; gene_name "ISG15"; transcript_type "protein_coding"; transcript_name "ISG15-201"; exon_number 3; exon_id "ENSE00003756535.1"; level 2; protein_id "ENSP00000485313.1"; transcript_support_level "3"; hgnc_id "HGNC:4053"; tag "mRNA_end_NF"; tag "cds_end_NF"; havana_gene "OTTHUMG00000040777.4"; havana_transcript "OTTHUMT00000479385.2";

chr1    1013983 1014004 ENSG00000187608.10      .       +       HAVANA  UTR     .       gene_id "ENSG00000187608.10"; transcript_id "ENST00000624697.4"; gene_type "protein_coding"; gene_name "ISG15"; transcript_type "protein_coding"; transcript_name "ISG15-202"; exon_number 3; exon_id "ENSE00003760003.2"; level 2; protein_id "ENSP00000485643.1"; transcript_support_level "3"; hgnc_id "HGNC:4053"; tag "basic"; havana_gene "OTTHUMG00000040777.4"; havana_transcript "OTTHUMT00000479384.2";

chr1    1013983 1014435 ENSG00000187608.10      .       +       HAVANA  exon    .       gene_id "ENSG00000187608.10"; transcript_id "ENST00000624652.1"; gene_type "protein_coding"; gene_name "ISG15"; transcript_type "protein_coding"; transcript_name "ISG15-201"; exon_number 3; exon_id "ENSE00003756535.1"; level 2; protein_id "ENSP00000485313.1"; transcript_support_level "3"; hgnc_id "HGNC:4053"; tag "mRNA_end_NF"; tag "cds_end_NF"; havana_gene "OTTHUMG00000040777.4"; havana_transcript "OTTHUMT00000479385.2";

chr1    1013983 1014475 ENSG00000187608.10      .       +       HAVANA  CDS     0       gene_id "ENSG00000187608.10"; transcript_id "ENST00000649529.1"; gene_type "protein_coding"; gene_name "ISG15"; transcript_type "protein_coding"; transcript_name "ISG15-203"; exon_number 2; exon_id "ENSE00001480799.4"; level 2; protein_id "ENSP00000496832.1"; hgnc_id "HGNC:4053"; tag "CAGE_supported_TSS"; tag "basic"; tag "Ensembl_canonical"; tag "MANE_Select"; tag "appris_principal_1"; tag "CCDS"; ccdsid "CCDS6.1"; havana_gene "OTTHUMG00000040777.4"; havana_transcript "OTTHUMT00000501480.1";

etc....
sbridgett commented 1 year ago

I've found a similar question about converting gtf to bed using bedparse, so I'll try that method: https://github.com/tleonardi/nanocompore/issues/132