hartwigmedical / hmftools

Various algorithms for analysing genomics data
GNU General Public License v3.0
187 stars 58 forks source link

cider - Invalid reference index - htsjdk.samtools.QueryInterval #376

Closed 3d7l closed 1 year ago

3d7l commented 1 year ago

Hello,

I face the error below when running cider on rna-seq data:

09:21:38.010 INFO  [main] - found constant region gene: IGHD5OR15-5B, type: IGHD, location: chr15:21010494-21010516(-)
09:21:38.010 INFO  [main] - found constant region gene: IGHD4OR15-4B, type: IGHD, location: chr15:21011451-21011469(-)
09:21:38.010 INFO  [main] - found constant region gene: IGHD3OR15-3B, type: IGHD, location: chr15:21012559-21012589(-)
09:21:38.010 INFO  [main] - found constant region gene: IGHD2OR15-2B, type: IGHD, location: chr15:21015048-21015078(-)
09:21:38.010 INFO  [main] - found constant region gene: IGHD1OR15-1B, type: IGHD, location: chr15:21017800-21017816(-)
09:21:38.079 INFO  [main] - found 372 gene locations
09:21:38.089 DEBUG [main] - Processing 432 potential sites in bam /home/jerome/crc/tmp/test_cider/output01Aligned.out.bam
09:21:38.090 INFO  [main] - 2 bam record consumer threads started
09:21:38.091 DEBUG [worker-0] - bam record consumer thread start
09:21:38.091 DEBUG [worker-1] - bam record consumer thread start
09:21:38.104 DEBUG [main] - bam reader start
09:21:38.105 DEBUG [main] - querying genome region: GenomeRegionImpl{chromosome=chr2, start=89884164, end=89886193}
09:21:38.106 ERROR [main] - [Thread[main,5,main]]: uncaught exception: java.lang.IllegalArgumentException: Invalid reference index -1
java.lang.IllegalArgumentException: Invalid reference index -1
        at htsjdk.samtools.QueryInterval.<init>(QueryInterval.java:24)
        at htsjdk.samtools.SamReader$PrimitiveSamReaderToSamReaderAdapter.query(SamReader.java:538)
        at htsjdk.samtools.SamReader$PrimitiveSamReaderToSamReaderAdapter.queryOverlapping(SamReader.java:400)
        at com.hartwig.hmftools.cider.AsyncBamReader$BamReader.run(AsyncBamReader.kt:76)
        at com.hartwig.hmftools.cider.AsyncBamReader.processBam(AsyncBamReader.kt:50)
        at com.hartwig.hmftools.cider.CiderApplication.readBamFile(CiderApplication.kt:142)
        at com.hartwig.hmftools.cider.CiderApplication.run(CiderApplication.kt:74)
        at com.hartwig.hmftools.cider.CiderApplication$Companion.main(CiderApplication.kt:246)
        at com.hartwig.hmftools.cider.CiderApplication.main(CiderApplication.kt)

Any idea ?

I used STAR (version 2.7.10b) to index the reference genome and to create the BAM file from the rna-seq FASTA file.

I used "ensembl_gene_data.csv" v38.

I used the following reference genome: https://ftp.ncbi.nih.gov/genomes/refseq/vertebrate_mammalian/Homo_sapiens/all_assembly_versions/GCF_000001405.26_GRCh38

hongwingl commented 1 year ago

Looks like it cannot find the chromosome chr2 in the bam header. Could you show me what the bam header looks like with: samtools view -H /home/jerome/crc/tmp/test_cider/output01Aligned.out.bam Thanks.

3d7l commented 1 year ago

bam_header.txt

hongwingl commented 1 year ago

thanks. Cider requires the following chromosome names chr1-chr22 and chrX, chrY.

@SQ     SN:chr1 LN:248956422
@SQ     SN:chr2 LN:242193529
@SQ     SN:chr3 LN:198295559
@SQ     SN:chr4 LN:190214555
@SQ     SN:chr5 LN:181538259
@SQ     SN:chr6 LN:170805979
@SQ     SN:chr7 LN:159345973
@SQ     SN:chr8 LN:145138636
@SQ     SN:chr9 LN:138394717
@SQ     SN:chr10        LN:133797422
@SQ     SN:chr11        LN:135086622
@SQ     SN:chr12        LN:133275309
@SQ     SN:chr13        LN:114364328
@SQ     SN:chr14        LN:107043718
@SQ     SN:chr15        LN:101991189
@SQ     SN:chr16        LN:90338345
@SQ     SN:chr17        LN:83257441
@SQ     SN:chr18        LN:80373285
@SQ     SN:chr19        LN:58617616
@SQ     SN:chr20        LN:64444167
@SQ     SN:chr21        LN:46709983
@SQ     SN:chr22        LN:50818468
@SQ     SN:chrX LN:156040895
@SQ     SN:chrY LN:57227415

I will make cider output clearer when it cannot find the chromosome in the bam file rather than just crashing.

hongwingl commented 1 year ago

You can use our reference genome if that helps. It is here: HMFtools-Resources/ref_genome/38

3d7l commented 1 year ago

Indeed, that was the problem !

Chromosome notation differs depending on the reference genome provider ucsc => chr1 ensembl => 1 ncbi => NC_000001.11

Thanks for your help.