gymrek-lab / TRTools

Toolkit for genome-wide analysis of tandem repeats
https://trtools.readthedocs.io/
MIT License
53 stars 20 forks source link

Handling of chrX data with dumpSTR #89

Open mikmaksi opened 4 years ago

mikmaksi commented 4 years ago

Hello,

I encountered the following error when running DumpSTR on a vcf produced by GangSTR that had only chrX calls for 3 samples. Two of the samples are female and have 2 comma separated values in their REPCN field, while 1 sample is male and has a single value in its REPCN field

Traceback (most recent call last):
  File "/usr/local/bin/dumpSTR", line 11, in <module>
    load_entry_point('trtools==2.0.4', 'console_scripts', 'dumpSTR')()
  File "/usr/local/lib/python3.6/site-packages/trtools-2.0.4-py3.6.egg/dumpSTR/dumpSTR.py", line 941, in run
    retcode = main(args)
  File "/usr/local/lib/python3.6/site-packages/trtools-2.0.4-py3.6.egg/dumpSTR/dumpSTR.py", line 895, in main
    record = ApplyCallFilters(record, invcf, call_filters, sample_info)
  File "/usr/local/lib/python3.6/site-packages/trtools-2.0.4-py3.6.egg/dumpSTR/dumpSTR.py", line 597, in ApplyCallFilters
    filter_reasons = FilterCall(sample, call_filters)
  File "/usr/local/lib/python3.6/site-packages/trtools-2.0.4-py3.6.egg/dumpSTR/dumpSTR.py", line 544, in FilterCall
    if cfilt(sample) is not None: reasons.append(cfilt.GetReason())
  File "/usr/local/lib/python3.6/site-packages/trtools-2.0.4-py3.6.egg/dumpSTR/filters.py", line 605, in __call__
    ml = [int(item) for item in sample["REPCN"]]
TypeError: 'int' object is not iterable

Thanks so much!

LiterallyUniqueLogin commented 4 years ago

I need some more information to help you:

mikmaksi commented 4 years ago

Of course

  1. I moved a sample data and launch script to here /storage/mikhail/062620_dumpSTR_chrX_tshoot
    • 1_filter_with_dumpSTR.sh: launcher script
    • data/raw/chrX.vcf: input data
  2. The command used to run dumpSTR is in 1_filter_with_dumpSTR.sh
  3. I used the default TRTools installation available on snorlax
seboyden commented 3 years ago

I routinely get the same dumpSTR error message, using gangSTR v2.4.6 and dumpSTR v3.0.2 (as well as with earlier versions) to test 37 known pathogenic STR loci, 5 of which are on chrX. If I remove the chrX variants from the gangSTR output before providing it to dumpSTR, I do not get the error. I would also vote for handling of chrX by dumpSTR.

seboyden commented 3 years ago

Now trying on gangSTR v2.5.0 and dumpSTR v4.0.0, running on 3 samples with 2 males and 1 female, and where gangSTR output includes calls on chrX, I get the following error from dumpSTR:

Traceback (most recent call last):
  File "~/bin/dumpSTR", line 33, in <module>
    sys.exit(load_entry_point('trtools==4.0.0', 'console_scripts', 'dumpSTR')())
  File "~/lib/python3.6/site-packages/trtools-4.0.0-py3.6.egg/trtools/dumpSTR/dumpSTR.py", line 1245, in run
    retcode = main(args)
  File "~/lib/python3.6/site-packages/trtools-4.0.0-py3.6.egg/trtools/dumpSTR/dumpSTR.py", line 1183, in main
    record = ApplyCallFilters(record, call_filters, sample_info, invcf.samples)
  File "~/lib/python3.6/site-packages/trtools-4.0.0-py3.6.egg/trtools/dumpSTR/dumpSTR.py", line 569, in ApplyCallFilters
    filt_output = filt(record)
  File "~/lib/python3.6/site-packages/trtools-4.0.0-py3.6.egg/trtools/dumpSTR/filters.py", line 706, in __call__
    ci = np.stack(ci)
  File "<__array_function__ internals>", line 6, in stack
  File "~/lib/python3.6/site-packages/numpy-1.18.1-py3.6-linux-x86_64.egg/numpy/core/shape_base.py", line 426, in stack
    raise ValueError('all input arrays must have the same shape')
ValueError: all input arrays must have the same shape

If I delete the chrX variants from the GangSTR output, then DumpSTR runs fine.

Incidentally, Fragile X Syndrome and Kennedy disease are 2 common X-linked repeat expansion disorders, so it would be nice to be able to find these by gangSTR & dumpSTR.

rckeerthivasan commented 3 years ago

I am getting a similar error with chrX. If i remove chrX from the vcf file, dumpSTR runs smoothly. if i include, i get this error.

Traceback (most recent call last): File "/project/jcreminslab/kenrc_projects/softwares.2/TRTools/venv.1/bin/dumpSTR", line 8, in <module> sys.exit(run()) File "/project/jcreminslab/kenrc_projects/softwares.2/TRTools/venv.1/lib/python3.6/site-packages/trtools/dumpSTR/dumpSTR.py", line 1245, in run retcode = main(args) File "/project/jcreminslab/kenrc_projects/softwares.2/TRTools/venv.1/lib/python3.6/site-packages/trtools/dumpSTR/dumpSTR.py", line 1204, in main record.vcfrecord.INFO['HWEP'] = utils.GetHardyWeinbergBinomialTest(allele_freqs, genotype_counts) File "/project/jcreminslab/kenrc_projects/softwares.2/TRTools/venv.1/lib/python3.6/site-packages/trtools/utils/utils.py", line 312, in GetHardyWeinbergBinomialTest if gt[1] not in allele_freqs.keys(): IndexError: tuple index out of range

i use these commands:

GangSTR --bam input.bam --ref hg38.fa --regions hg38_ver13.bed --out input --bam-samps input --samp-sex M

dumpSTR --vcf input.vcf --out out.5 --gangstr-min-call-DP 5 --gangstr-filter-spanbound-only --gangstr-filter-badCI --gangstr-max-call-DP 1000 --gangstr-min-call-Q 0.6

merenlin commented 2 years ago

I'm having the same error as @rckeerthivasan when having variants on chrX on males My settings:

 ganstr_command = "GangSTR"\
                    + " --bam " + bamfile + " " + "--ref " + ref \
                    + " --regions regions/my_panel.tsv"\
                    + " --samp-sex " + patient_sex\
                    + " --bam-samps " + patient\
                    + " --out " + output_dir + "/" + file_name \
                    + " --output-readinfo"\
                    + " --nonuniform"\
                    + " --include-ggl"

  dumpstr_command = "dumpSTR" \
                    + " --vcf " + gzvcf_name\
                    + " --out " + filtered_dir + "/" + file_name\
                    + " --gangstr-min-call-DP  20"\
                    + " --gangstr-max-call-DP  1000"\
                    + " --gangstr-filter-spanbound-only"\
                    + " --gangstr-filter-badCI"\
                    + " --zip"\
                    + " --drop-filtered"

The error:

Traceback (most recent call last):
  File "/home/oxana/.local/bin/dumpSTR", line 8, in <module>
    sys.exit(run())
  File "/home/oxana/.local/lib/python3.8/site-packages/trtools/dumpSTR/dumpSTR.py", line 1245, in run
    retcode = main(args)
  File "/home/oxana/.local/lib/python3.8/site-packages/trtools/dumpSTR/dumpSTR.py", line 1204, in main
    record.vcfrecord.INFO['HWEP'] = utils.GetHardyWeinbergBinomialTest(allele_freqs, genotype_counts)
  File "/home/oxana/.local/lib/python3.8/site-packages/trtools/utils/utils.py", line 312, in GetHardyWeinbergBinomialTest
    if gt[1] not in allele_freqs.keys():
IndexError: tuple index out of range

Do you have any ideas for a workaround before it's fixed?

Running it on a tumor sample, so high variation is probable, including copy numbers and ploidy

seboyden commented 2 years ago

Only workaround I have is to delete chrX variants from the gangSTR output before passing to dumpSTR. It's not ideal considering there are several X-linked repeat expansion disorders, but at least it allows dumpSTR to run on the rest of the genome, and also verifies that this is the correct problem. Interesting that we're all getting different errors:

TypeError: 'int' object is not iterable ValueError: all input arrays must have the same shape IndexError: tuple index out of range