gymrek-lab / LongTR

Tandem repeat genotyping with long reads
GNU General Public License v2.0
22 stars 0 forks source link

terminate called after throwing an instance of 'std::bad_array_new_length' #15

Open HLHsieh opened 3 months ago

HLHsieh commented 3 months ago

Hi Helia,

Thank you again for your suggestions. I applied LongTR to twenty sets of data, and only one set encountered the following issue:

Detected 1 BAM/CRAM files
User-specified read groups for 1 unique samples
Reading region file /scratch/kinfai_root/kinfai0/hsinlun/reference/myDefinedRepeat_LongTR.bed
Region file contains 1 regions

Processing region chr4 190066140 190092504
121 reads overlapped region, of which
    0 were hard clipped
    0 had an 'N' base call
    97 had low MAPQ
    0 had low base quality scores
    20 did not span the STR
    0 did not have a unique mapping
    4 PASSED ALL FILTERS
Phased SNPs add info for 0 out of 4 reads and 0 out of 1 samples
Trimming reads
Generating candidate haplotypes
    TTCCTGGGCATCCCGGGGATCCCAGAGCCGGCCCA GGTACCAGCAGGTGGGCCGCCTACTGCGCACGCGCGGGTTTGCGGGCAGC...ACTGCCATTCTTTCCTGGGCATCCCGGGGATCCCAGAGCCGGCCCAG GTACCAGCAGGTGGGCCGCCTACTGCGCACGCGCG
                                        GGTACCAGCAGGTGGGCCGCCTACTGCGCACGCGCGGGTTTGCGGGCAGC...ACTGCCATTCTTTCCTGGGCATCCCGGGGATCCCAGAGCCGGCCCAG
                                        GGTACCAGCAGGTGGGCCGCCTACTGCGCACGCGCGGGTTTGCGGGCAGC...ACTGCCATTCTTTCCTGGGCATCCCGGGGATCCCAGAGCCGGCCCAG
                                        GGTACCAGCAGGTGGGCCGCCTACTGCGCACGCGCGGGTTTGCGGGCAGC...GAACTGCCATTCCCTAGCCATTCGCGGGTCCAGAGCCGGCGCGTTAA
                                        GGTACCAGCAGGTGGGCCGCCTACTGCGCACGCGCGGGTTTGCGGGCAGC...ACTGCCATTCTTTCCTGGGCATCCCGGGGATCCCAGAGCCGGCCCAG
Added 0 inexact haplotypes generated by POA
Aligning reads to each candidate haplotype
terminate called after throwing an instance of 'std::bad_array_new_length'
  what():  std::bad_array_new_length
/var/spool/slurmd.spool/job10211381/slurm_script: line 63: 3269819 Aborted                 (core dumped) $script --bams ${input_dir}/${myseq}.sorted.bam --fasta ${genome} --regions ${predefined} --tr-vcf ${myseq}.vcf.gz --bam-samps ${myseq} --bam-libs ${myseq} --min-mean-qual -1 --min-reads 1 --max-tr-len 500000 --skip-assembly

I have no idea how to fix the error. Any suggestions would be appreciated. PS, I still use the previous version.

Best, Hsin

heliziii commented 3 months ago

Hi Hsin,

It is difficult to say what exactly happened from the log only, but I see that the repeat is very long, ~25k bp and the error denotes something about the size of an array. Can you please share the repeat information? I'll try to genotype a sample at this locus.

Best, Helia

HLHsieh commented 3 months ago

Hi Helia,

I was trying to analyze the same repeat across several datasets, but only one dataset encountered an issue. Here is the repeat information:

chr4    190066141 190092504 3300    8   D4Z4

I have also attached the file that is causing the issue for your reference (https://buckeyemailosu-my.sharepoint.com/:f:/g/personal/hsieh_332_buckeyemail_osu_edu/EuD_CyvayNxNlKdmZGlDQdsBgQoBni_UxuZqVm93mzSIzg?e=D1YGAA).

Thank you for your assistance.

Many thanks, Hsin-Lun

heliziii commented 2 months ago

Hi Hsin,

I apologize for the late reply. Would that be possible for you to upload the bam file with reads aligning to this region only? current bam files are a bit large to download.

Best, Helia

HLHsieh commented 1 month ago

Hi Helia,

I apologize for missing your reply. I have uploaded the requested file: https://buckeyemailosu-my.sharepoint.com/:f:/g/personal/hsieh_332_buckeyemail_osu_edu/EuD_CyvayNxNlKdmZGlDQdsBgQoBni_UxuZqVm93mzSIzg?e=KJbvMZ

Additionally, regarding the TR region BED file, I have a few questions. Should the NUM_COPIES be an integer? Also, how should I consider a TR motif? For instance, I want to test a VNTR with 4 copies, but the motif lengths are not exactly the same in the reference genome, ranging from 46 to 50 bp due to variants.

Thank you, Hsin

heliziii commented 4 weeks ago

Hi Hsin,

Sorry for the delayed reply. I will look into the files asap.

For the region BED, NUM_COPIES doesn't need to be an integer. The TR motif sequence doesn't affect the final output in normal setting.

Best, Helia