lindenb / jvarkit

Java utilities for Bioinformatics
https://jvarkit.readthedocs.io/
Other
482 stars 133 forks source link

biostar103303 error: no exon found #117

Closed giuseppedelnapalle closed 5 years ago

giuseppedelnapalle commented 5 years ago

Hi,

I had a problem when running the biostar103303 program. The process failed giving the message "no exon found".

The command was:

ls *bam | java -jar biostar103303.jar -g Homo_sapiens.GRCh38.93.gtf > result.tsv

More error messages were shown below.

... [WARN][Biostar103303]chromosome in KI270721.1 ensembl exon 2585 2692 .

  • . gene_id "ENSG00000276345"; gene_version "1"; transcript_id "ENST00000612848"; transcript_version "1"; exon_number "1"; gene_name "AC004556.1"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_name "AC004556.1-201"; transcript_source "ensembl"; transcript_biotype "protein_coding"; exon_id "ENSE00003754502"; exon_version "1"; tag "basic"; transcript_support_level "1"; not in SAMSequenceDictionary [WARN][Biostar103303]chromosome in KI270726.1 ensembl exon 26241 26534 .
  • . gene_id "ENSG00000277856"; gene_version "1"; transcript_id "ENST00000619729"; transcript_version "1"; exon_number "1"; gene_name "AC233755.2"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_name "AC233755.2-201"; transcript_source "ensembl"; transcript_biotype "protein_coding"; exon_id "ENSE00003738911"; exon_version "1"; tag "basic"; transcript_support_level "NA"; not in SAMSequenceDictionary [WARN][Biostar103303]chromosome in KI270711.1 ensembl exon 24492 24650 .
  • . gene_id "ENSG00000271254"; gene_version "6"; transcript_id "ENST00000614336"; transcript_version "4"; exon_number "1"; gene_name "AC240274.1"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_name "AC240274.1-201"; transcript_source "ensembl"; transcript_biotype "protein_coding"; exon_id "ENSE00003720748"; exon_version "1"; tag "basic"; transcript_support_level "1"; not in SAMSequenceDictionary [WARN][Biostar103303]chromosome in KI270713.1 ensembl exon 21861 22024 .
  • . gene_id "ENSG00000275405"; gene_version "1"; transcript_id "ENST00000619109"; transcript_version "1"; exon_number "1"; gene_name "RF00003"; gene_source "ensembl"; gene_biotype "snRNA"; transcript_name "RF00003.35-201"; transcript_source "ensembl"; transcript_biotype "snRNA"; exon_id "ENSE00003747303"; exon_version "1"; tag "basic"; transcript_support_level "NA"; not in SAMSequenceDictionary [INFO][Biostar103303]End Reading /path/to/Homo_sapiens.GRCh38.93.gtf N=0 [SEVERE][Biostar103303]no exon found [INFO][Launcher]biostar103303 Exited with failure (-1)

The bam file was created by STAR, and the parameters for STAR were:

STAR --runThreadN 8 \ --genomeDir ${STARindex_dir} \ --outSAMtype BAM SortedByCoordinate \ --readFilesIn $read_1 $read_2 \ --readFilesCommand zcat \ --outFileNamePrefix sample_1 \ --outReadsUnmapped Fastx \ --outFilterMultimapNmax 20 \ --alignSJoverhangMin 8 \ --alignSJDBoverhangMin 1 \ --outFilterMismatchNmax 999 \ --outFilterMismatchNoverReadLmax 0.05 \ --alignIntronMin 10 \ --alignIntronMax 1000000 \ --alignMatesGapMax 1000000 \ --limitBAMsortRAM 59999999999

Additionally, the gtf file was downloaded here (ftp://ftp.ensembl.org/pub/release-93/gtf/homo_sapiens/Homo_sapiens.GRCh38.93.gtf.gz).

Your environment

Do you have any idea what went wrong?

Thank you.

lindenb commented 5 years ago

Hi, I haven't used this software since I wrote it for biostars. I'm going to have a look....

lindenb commented 5 years ago

I think your problem is that your bam uses a chromosome notation '1','2',... that is not the same as your gtf file 'chr1','chr2',... I added a method to automatically convert the chromosomes' names.

giuseppedelnapalle commented 5 years ago

I think your problem is that your bam uses a chromosome notation '1','2',... that is not the same as your gtf file 'chr1','chr2',... I added a method to automatically convert the chromosomes' names.

The annotation of chromosome of the gtf file is not the cause. I used the same gtf file from Ensembl for building STAR index and biostar103303.jar, so inconsistency of annotation should not be the case. I checked the format of sam (created with samtools from bam), and it proved the speculation. The chromosome of the sam is annotated as 1, 2, 6 etc. Here is an example of the sam I checked.

XXX9743853.50878191 419 1 11594 1 100M = 11643 148 CTGTATCCCACCAGCAATGTCTAGGAATACCTGTTTCTCCACAAAGTGTTTACTTTTGGATTTTTGCCAGTCTAACAGGTGAAGCCCTGGAGATTCTTATBBBFFFFFFFFFFIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIFIIIFIIIIIIIIIIIIIIIIIIIIIIIFFFFFFFFFFFFFFFFBFBFFFBFFF NH:i:3 HI:i:2 AS:i:195 nM:i:1

lindenb commented 5 years ago

but on my side, I'm finding some exons with a random bam and the specified gtf ? (using http instead of ftp )

 java -jar dist/biostar103303.jar -g 'http://ftp.ensembl.org/pub/release-93/gtf/homo_sapiens/Homo_sapiens.Gh38.93.gtf.gz' src/test/resources/HG02260.transloc.chr9.14.bam 2>&1 | grep -v "Cannot find contig" | head

[INFO][Biostar103303]Reading http://ftp.ensembl.org/pub/release-93/gtf/homo_sapiens/Homo_sapiens.GRCh38.93.gtf.gz
[INFO][Biostar103303]End Reading http://ftp.ensembl.org/pub/release-93/gtf/homo_sapiens/Homo_sapiens.GRCh38.93.gtf.gz N=1237092
[INFO][Biostar103303]. Completed. N=477. That took:0 second
#chrom  exon.start  exon.end    exon.exon_id    exon.index5_3   transcript_id   gene_name   gene_id exon.count_prev_and_next    exon.count_prev_and_curr    exon.count_curr_and_next    exon.count_curr_only    exon.count_others
22  10736171    10736283    ENSE00003736336 1/1 ENST00000615943 RF00004 ENSG00000277248 0   0   0   0   0
22  10939388    10939423    ENSE00003790077 1/9 ENST00000635667 FRG1FP  ENSG00000283047 0   0   0   0   0
22  10940597    10940707    ENSE00003791492 2/9 ENST00000635667 FRG1FP  ENSG00000283047 0   0   0   0   0
22  10941691    10941780    ENSE00003787209 3/9 ENST00000635667 FRG1FP  ENSG00000283047 0   0   0   0   0
22  10944967    10945053    ENSE00003785466 4/9 ENST00000635667 FRG1FP  ENSG00000283047 0   0   0   0   0
giuseppedelnapalle commented 5 years ago

There seems to be something wrong with my biostar103303.jar software. I had an error when testing the example you gave (last command in the document of biostar103303). The full messages were shown below.

[giuseppe@localhost dist]$ curl -s "http://hgdownload-test.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeCshlLongRnaSeq/wgEncodeCshlLongRnaSeqA549CellLongnonpolyaAlnRep1.bam" | java -jar biostar103303.jar -g "http://atgu.mgh.harvard.edu/plinkseq/dist/aux/gencodeBasicV11-hg19.gtf.gz" > result.tsv [INFO][Biostar103303]Reading sfomr stdin [INFO][Biostar103303]Reading http://atgu.mgh.harvard.edu/plinkseq/dist/aux/gencodeBasicV11-hg19.gtf.gz [SEVERE][Biostar103303]http://atgu.mgh.harvard.edu/plinkseq/dist/aux/gencodeBasicV11-hg19.gtf.gz java.io.FileNotFoundException: http://atgu.mgh.harvard.edu/plinkseq/dist/aux/gencodeBasicV11-hg19.gtf.gz at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1890) at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1492) at java.net.URL.openStream(URL.java:1045) at com.github.lindenb.jvarkit.io.IOUtils.openURIForReading(IOUtils.java:272) at com.github.lindenb.jvarkit.io.IOUtils.openURIForLineReader(IOUtils.java:400) at com.github.lindenb.jvarkit.io.IOUtils.openURIForLineIterator(IOUtils.java:405) at com.github.lindenb.jvarkit.tools.biostar.Biostar103303.readGTF(Biostar103303.java:169) at com.github.lindenb.jvarkit.tools.biostar.Biostar103303.doWork(Biostar103303.java:328) at com.github.lindenb.jvarkit.util.jcommander.Launcher.instanceMain(Launcher.java:1208) at com.github.lindenb.jvarkit.util.jcommander.Launcher.instanceMainWithExit(Launcher.java:1366) at com.github.lindenb.jvarkit.tools.biostar.Biostar103303.main(Biostar103303.java:514) [INFO][Launcher]biostar103303 Exited with failure (-1)

Environment: version of jvarkit: jvarkit 2018.04.05 version of java: java version "1.8.0_192" the value of ${JAVA_HOME}: /usr/java/jdk1.8.0_192-amd64 OS: CentOS 7

Would you please offer some advice?

Cheers.

lindenb commented 5 years ago

the url for your GTF is wrong.

$ wget -O - "http://atgu.mgh.harvard.edu/plinkseq/dist/aux/gencodeBasicV11-hg19.gtf.gz" > /dev/null 
--2018-11-29 14:11:15--  http://atgu.mgh.harvard.edu/plinkseq/dist/aux/gencodeBasicV11-hg19.gtf.gz
(...)
Proxy request sent, awaiting response... 404 Not Found
2018-11-29 14:11:16 ERROR 404: Not Found.
giuseppedelnapalle commented 5 years ago

OK, the url in the example is not valid. I think the problem was caused by something relevant to java. When I ran the command

java -jar biostar103303.jar path/to/sample.bam -g path/to/Homo_sapiens.GRCh38.93.gtf > result.tsv

The following message indicating illegal number of arguments was returned:

[SEVERE][Biostar103303]Illegal number of arguments. [INFO][Launcher]biostar103303 Exited with failure (-1)

Any suggestion on this issue?

lindenb commented 5 years ago

this was fixed this morning when I've updated the code, please pull the new code.

Otherwise try

cat   path/to/sample.bam | java -jar biostar103303.jar  -g path/to/Homo_sapiens.GRCh38.93.gtf > result.tsv
giuseppedelnapalle commented 5 years ago

Thank you, the program seems to be running well. I'll see if it can output reasonable results.