matthewwiese / apple-phg

A Practical Haplotype Graph (PHG) for apple (Malus domestica)
https://haplotype.net/apple-phg
1 stars 0 forks source link

Genic regions before merge is 0 #1

Closed matthewwiese closed 1 year ago

matthewwiese commented 1 year ago

For some reason we end up with 0 genic regions:

Number of Genic Regions: 52039
Number of genicRegions Before Merge: 0

The suspect code in question is this Kotlin lambda within CreateRefRangeUtils.kt. I am not sure whether this is the result of the assemblies sourced from NCBI, my preprocessing steps, or a bug in the PHG.

Full log below; the lambda mentioned above creates an empty list and as such triggers an Empty list doesn't contain element at index 0 error on line 189.

[main] INFO net.maizegenetics.tassel.TasselLogging - Tassel Version: 5.2.89  Date: May 19, 2023
[main] INFO net.maizegenetics.tassel.TasselLogging - Max Available Memory Reported by JVM: 262144 MB
[main] INFO net.maizegenetics.tassel.TasselLogging - Java Version: 17.0.3-internal
[main] INFO net.maizegenetics.tassel.TasselLogging - OS: Linux
[main] INFO net.maizegenetics.tassel.TasselLogging - Number of Processors: 64
[main] INFO net.maizegenetics.tassel.TasselLogging - Tassel Citation: Bradbury PJ, Zhang Z, Kroon DE, Casstevens TM, Ramdoss Y, Buckler ES. (2007) TASSEL: Software for association mapping of complex traits in diverse samples. Bioinformatics 23:2633-2635.
[main] INFO net.maizegenetics.tassel.TasselLogging -
[main] INFO net.maizegenetics.tassel.TasselLogging - Tassel Using Library: Practical Haplotype Graph (PHG): Version: 1.6 Date: July 11, 2023
[main] INFO net.maizegenetics.tassel.TasselLogging - PHG Citation: Bradbury PJ, Casstevens T, Jensen SE, Johnson LC, Miller ZR, Monier B, Romay MC, Song B, Buckler ES. The Practical Haplotype Graph, a platform for storing and using pangenomes for imputation. Bioinformatics. 2022 Aug 2;38(15):3698-3702. doi: 10.1093/bioinformatics/btac410. PMID: 35748708; PMCID: PMC9344836.
[main] INFO net.maizegenetics.pipeline.TasselPipeline - Tassel Pipeline Arguments: [-fork1, -CreateRefRangesPlugin, -wiggleDir, ./wiggle/coverage, -gffFile, ./data/reference/genomic_strand_fixed.gff, -minCover, 7, -outputBedFile, ./refRanges.bed, -refGenome, ./data/reference/GCF_002114115.1_ASM211411v1_genomic.fna, -vcfdir, ./gvcf, -outputGeneRanges, ./geneRanges.bed, -nThread
s, 62, -endPlugin, -runfork1]
net.maizegenetics.pangenome.pipeline.CreateRefRangesPlugin
[pool-1-thread-1] INFO net.maizegenetics.plugindef.AbstractPlugin - Starting net.maizegenetics.pangenome.pipeline.CreateRefRangesPlugin: time: Sep 26, 2023 9:53:5
[pool-1-thread-1] INFO net.maizegenetics.plugindef.AbstractPlugin -
CreateRefRangesPlugin Parameters
wiggleDir: ./wiggle/coverage
secondaryWiggleDir: null
gffFile: ./data/reference/genomic_strand_fixed.gff
gffFeatureType: CDS
minCover: 7
secondaryMinCover: -1
windowSize: 10
intergenicStepSize: 50000
maxSearchWindow: 10000
outputBedFile: ./refRanges.bed
refGenome: ./data/reference/GCF_002114115.1_ASM211411v1_genomic.fna
vcfdir: ./gvcf
outputGeneRanges: ./geneRanges.bed
useSecondaryForIntergenic: false
mxDiv: 1.0E-4
minLength: 1000
maxClusters: 10
nThreads: 62

[pool-1-thread-1] INFO net.maizegenetics.pangenome.pipeline.CreateRefRangesPlugin - Loaded first batch of wiggle files.
[pool-1-thread-1] INFO net.maizegenetics.pangenome.pipeline.CreateRefRangesPlugin - 10 41841605 null
[pool-1-thread-1] INFO net.maizegenetics.pangenome.pipeline.CreateRefRangesPlugin - 11 42925075 null
[pool-1-thread-1] INFO net.maizegenetics.pangenome.pipeline.CreateRefRangesPlugin - 12 33134071 null
[pool-1-thread-1] INFO net.maizegenetics.pangenome.pipeline.CreateRefRangesPlugin - 13 44437459 null
[pool-1-thread-1] INFO net.maizegenetics.pangenome.pipeline.CreateRefRangesPlugin - 14 32560231 null
[pool-1-thread-1] INFO net.maizegenetics.pangenome.pipeline.CreateRefRangesPlugin - 15 55080361 null
[pool-1-thread-1] INFO net.maizegenetics.pangenome.pipeline.CreateRefRangesPlugin - 16 41441581 null
[pool-1-thread-1] INFO net.maizegenetics.pangenome.pipeline.CreateRefRangesPlugin - 17 34817048 null
[pool-1-thread-1] INFO net.maizegenetics.pangenome.pipeline.CreateRefRangesPlugin - 1 32709648 null
[pool-1-thread-1] INFO net.maizegenetics.pangenome.pipeline.CreateRefRangesPlugin - 2 37631755 null
[pool-1-thread-1] INFO net.maizegenetics.pangenome.pipeline.CreateRefRangesPlugin - 3 37690471 null
[pool-1-thread-1] INFO net.maizegenetics.pangenome.pipeline.CreateRefRangesPlugin - 4 32357154 null
[pool-1-thread-1] INFO net.maizegenetics.pangenome.pipeline.CreateRefRangesPlugin - 5 48068851 null
[pool-1-thread-1] INFO net.maizegenetics.pangenome.pipeline.CreateRefRangesPlugin - 6 37231166 null
[pool-1-thread-1] INFO net.maizegenetics.pangenome.pipeline.CreateRefRangesPlugin - 7 36738692 null
[pool-1-thread-1] INFO net.maizegenetics.pangenome.pipeline.CreateRefRangesPlugin - 8 31666303 null
[pool-1-thread-1] INFO net.maizegenetics.pangenome.pipeline.CreateRefRangesPlugin - 9 37676754 null
[pool-1-thread-1] INFO org.biojava.nbio.genome.parsers.gff.GFF3Reader - Reading: ./data/reference/genomic_strand_fixed.gff
Number of Genic Regions: 52039
Number of genicRegions Before Merge: 0
[pool-1-thread-1] DEBUG net.maizegenetics.plugindef.AbstractPlugin - Empty list doesn't contain element at index 0.
java.lang.IndexOutOfBoundsException: Empty list doesn't contain element at index 0.
        at kotlin.collections.EmptyList.get(Collections.kt:36)
        at kotlin.collections.EmptyList.get(Collections.kt:24)
        at net.maizegenetics.pangenome.pipeline.CreateRefRangeUtils.mergeGenicRegions(CreateRefRangeUtils.kt:236)
        at net.maizegenetics.pangenome.pipeline.CreateRefRangeUtils.createGenicRegions(CreateRefRangeUtils.kt:198)
        at net.maizegenetics.pangenome.pipeline.CreateRefRangesPlugin.processData(CreateRefRangesPlugin.kt:136)
        at net.maizegenetics.plugindef.AbstractPlugin.performFunction(AbstractPlugin.java:111)
        at net.maizegenetics.plugindef.AbstractPlugin.dataSetReturned(AbstractPlugin.java:2017)
        at net.maizegenetics.plugindef.ThreadedPluginListener.run(ThreadedPluginListener.java:29)
        at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
        at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
        at java.base/java.lang.Thread.run(Thread.java:833)
matthewwiese commented 1 year ago

This was caused by the chromosome names in the reference GFF retaining their original names (e.g. NC_041789.1 instead of chr1). Fixed in this commit: https://github.com/matthewwiese/apple-phg/commit/a8e059ec78ae3cff3f4c883f5d504cadcf58f1ab

Additionally, I discovered the chromosome names of drMalSylv7.2 were being erroneously renamed according to drMalSylv7.3. Although not the cause of this issue, they would have caused problems later on: https://github.com/matthewwiese/apple-phg/commit/536264e64cf0b9e972ea51b444e87ee00fbcead9