ucl-pathgenomics / HaROLD

Haplotype Reconstruction of Longitudinal Deep sequencing data
MIT License
6 stars 1 forks source link

Exception in thread "main" java.lang.NegativeArraySizeException #7

Open Al313 opened 9 months ago

Al313 commented 9 months ago

Describe the bug

When trying to create the "strandcount" tables from bam files I get the following error:

./my-data/step0-input/iii-13-13MT2EXPIIIVP10FullRepseq25052018_S1_L001-sorted.bam.
Exception in thread "main" java.lang.NegativeArraySizeException: -449
    at makereadcount.Read.<init>(Read.java:39)
    at makereadcount.MakeReadCount.readFromFile(MakeReadCount.java:447)
    at makereadcount.MakeReadCount.readData(MakeReadCount.java:427)
    at makereadcount.MakeReadCount.run(MakeReadCount.java:62)
    at makereadcount.MakeReadCount.main(MakeReadCount.java:44)

Here's the code that I run for this step:

for i in $(ls ./my-data/step0-input); do echo ${I} && java -cp ./lib/htsjdk-unspecified-SNAPSHOT.jar:./lib/picocli-4.1.2.jar:./lib/pal-1.5.1.jar:./lib/cache2k-all-1.0.2.Final.jar:./lib/commons-math3-3.6.1.jar:./jar/MakeReadCount.jar makereadcount.MakeReadCount ./my-data/step0-input/${i}; done

Additional context

  1. The same command works for example bam files provided within the package.
  2. My bam files were created using BWA aligner with default parameters.
cristina86cristina commented 9 months ago

Thanks for the message! Can I have a few more details about the bam file? Which virus you are working on and which reference have you used for mapping? Also did you run the first step with samtools prior trying the makereadcount command?

On Mon, Dec 11, 2023 at 7:02 PM Ali @.***> wrote:

Describe the bug

When trying to create the "strandcount" tables from bam files I get the following error:

./my-data/step0-input/iii-13-13MT2EXPIIIVP10FullRepseq25052018_S1_L001-sorted.bam. Exception in thread "main" java.lang.NegativeArraySizeException: -449 at makereadcount.Read.(Read.java:39) at makereadcount.MakeReadCount.readFromFile(MakeReadCount.java:447) at makereadcount.MakeReadCount.readData(MakeReadCount.java:427) at makereadcount.MakeReadCount.run(MakeReadCount.java:62) at makereadcount.MakeReadCount.main(MakeReadCount.java:44)

Here's the code that I run for this step:

for i in $(ls ./my-data/step0-input); do echo ${I} && java -cp ./lib/htsjdk-unspecified-SNAPSHOT.jar:./lib/picocli-4.1.2.jar:./lib/pal-1.5.1.jar:./lib/cache2k-all-1.0.2.Final.jar:./lib/commons-math3-3.6.1.jar:./jar/MakeReadCount.jar makereadcount.MakeReadCount ./my-data/step0-input/${i}; done

Additional context

  1. The same command works for example bam files provided within the package.
  2. My bam files were created using BWA aligner with default parameters.

— Reply to this email directly, view it on GitHub https://github.com/ucl-pathgenomics/HaROLD/issues/7, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNFCSXGELEOKWPP2PVVIWTYI5KDRAVCNFSM6AAAAABAQICB2GVHI2DSMVQWIX3LMV43ASLTON2WKOZSGAZTMMZUGIYTAMI . You are receiving this because you are subscribed to this thread.Message ID: @.***>

Al313 commented 9 months ago

Thank you very much for the swift response.

We have done paired-end short-read Illumina sequencing of HIV-1 genome as part of a long-term experimental evolution. Currently there are more than 50 longitudinal sequencing data for each of our 8 independent passaging lines.

The reference sequence that I used is the consensus sequence of the plasmid used to produce the ancestral virus (HIV-1 NL4-3) in fasta format. The command that I use for mapping is as follows (it's a snippet of my snakemake pipeline):

rule all:
    input:
        expand("/home/amovas/data/scrap/test-harold/HaROLD/my-data/step0-input/{experiment}-{line}-{sample}-sorted.bam", zip,experiment=EXP, line=LINES, sample=SAMPLES)

rule bwa_map:
    input:
        ref="/home/amovas/data/genome-evo-proj/data/reference/plasmid/hiv_plasmid_ref_genome.fasta", 
        read1="/home/amovas/data/genome-evo-proj/data/freezed-raw-data/fastq/{experiment}/{line}/{sample}_R1_001.fastq.gz",
        read2="/home/amovas/data/genome-evo-proj/data/freezed-raw-data/fastq/{experiment}/{line}/{sample}_R2_001.fastq.gz"
    output:
        temp("/home/amovas/data/scrap/test-harold/HaROLD/my-data/step0-input/{experiment}-{line}-{sample}.bam.tmp")
    threads: 1
    shell:
        "(bwa mem -t {threads} {input.ref} {input.read1} {input.read2} | "
        "samtools view -Sb - > {output})"

rule samtools_sort:
    input:
        "/home/amovas/data/scrap/test-harold/HaROLD/my-data/step0-input/{experiment}-{line}-{sample}.bam.tmp"
    output:
        "/home/amovas/data/scrap/test-harold/HaROLD/my-data/step0-input/{experiment}-{line}-{sample}-sorted.bam"
    shell:
        "samtools sort {input} > {output}"

Please let me know if it would help if I share one sequencing sample + reference sequence to pinpoint the problem (unfortunately sequencing files are a bit larger than GitHub size limit to share here).

Thanks in advance.

cristina86cristina commented 9 months ago

Thanks for the info. I think you need to run the first step in the README file under "prepare input files". Even if not necessary depending on the software used for alignment, it is needed for bam files produced with bwa.

samtools view -h -G69 file.bam | samtools view -h -G133 > newfile.bam

This gets rid of unmapped/duplicated multi-aligned reads. You can check the number of mapped reads with samtools in both files:

samtools view -c -F 260 file.bam

samtools view -c -F 260 newfile.bam

This should give you the same number.

Let me know if this solves the problem! If not, yes please share a bam file with me and I'll have a look!

Al313 commented 9 months ago

Yes, I did try running "makereadcount" both on original and new bam files but the error persisted.

I have uploaded the reference sequence + sequencing reads of one sample on Google Drive. I will share the link with you via email.