Quantification throwing error - missing reference?

Michael-Geuenich commented 3 years ago

I'm trying to get the allele specific expression of arcasHLA-quant to work. So far I've used arcasHLA to extract all reads that map to chr6, and genotyped my sample using genotype (from the original arcasHLA repo).

I'm now trying to run the quantification command of arcasHLA-quant. I've tried this by running

arcasHLA quant -o /my_output_folder/ /extracted_fastqs/sample1.extracted.1.fq.gz /extracted_fastqs/sample1.extracted.2.fq.gz

This results in the following error:

Traceback (most recent call last): File "/home/arcasHLA-quant/scripts/quant.py", line 178, in <module> indv_idx = args.ref + '.idx' TypeError: unsupported operand type(s) for +: 'NoneType' and 'str'

Looking at the quant.py script, it seems to me that this is because I did not supply a --ref flag, and the algorithm then tries to do None + '.idx'.

My first question: what file do I need to provide for the --ref flag and how can I create this?

I'm guessing this should come from the quant_ref command referenced in the readme, however, it appears as though this has been renamed to customize? I've tried running customize instead of quant_ref as described in the readme:

arcasHLA customize ~/arcas/sample1.genotype.json -o ~/arcas/ref

However, this results in the following error:

usage: arcasHLA customize [options] arcasHLA customize: error: unrecognized arguments: /arcas/sample1.genotype.json

/arcas/sample1.genotype.json is the output from my genotyping call.

Can you advise? In particular, I am confused about whether or not I need a reference and what the purpose of this is? Also, if I have multiple samples, do I need to create one reference per sample or one reference for all samples?

davetang commented 3 years ago

I'm also interested in the quant functionality and I have the same problem as @Michael-Geuenich. The extract and genotype steps worked fine.

Also just for your information, when I try to run the customize step, I get the following error:

Traceback (most recent call last):
  File "/home/dtang/src/test/arcasHLA-quant/scripts/customize.py", line 43, in <module>
    from Bio.Alphabet import generic_dna
  File "/home/ngs/Analysis/tools/arcashla_venv/lib/python3.9/site-packages/Bio/Alphabet/__init__.py", line 20, in <module>
    raise ImportError(
ImportError: Bio.Alphabet has been removed from Biopython. In many cases, the alphabet can simply be ignored and removed from scripts. In a few cases, you may need to specify the ``molecule_type`` as an annotation on a SeqRecord for your script to work correctly. Please see https://biopython.org/wiki/Alphabet for more information.

I guess I need to use an older version of Biopython.

tpereachamblee commented 3 years ago

Hello! Thanks for your interest in our tool.

@Michael-Geuenich you'll want to run the "customize" script before the "quant" script, which will construct the missing reference. Use the genotype.json from the "genotype" step:

arcasHLA customize --transcriptome chr6 --genotype sample.genotype.json -o ref

The README will be updated to make this clearer.

@davetang you're correct, this error is caused by a depreciated object in the latest Biopython release. For a quick fix, you can revert to an older version of Biopython (1.77 or earlier) until we can provide a workaround in an upcoming release.

davetang commented 3 years ago

Hi @tpereachamblee,

thank you for the comment. I downgraded my Biopython version and could run the customize subcommand. However, I'm getting the following error after running:

arcasHLA quant --ref ref/my_sample -o test my_sample.extracted.1.fq.gz my_sample.extracted.2.fq.gz

Traceback (most recent call last):
  File "/home/dtang/Analysis/tools/arcasHLA-quant_down/scripts/quant.py", line 300, in <module>
    baf = allele_results[gene]['allele1_count'] / (allele_results[gene]['allele1_count'] + allele_results[gene]['allele2_count'])
ZeroDivisionError: division by zero

May you let me know what the problem could be?

Thanks again, Dave

Michael-Geuenich commented 3 years ago

Hi @tpereachamblee,

thanks for the additional clarification. I've run the command exactly as you specified within the arcasHLA-quant docker container. Unfortunately I am still getting the following error:

Traceback (most recent call last):
  File "/home/arcasHLA-quant/scripts/customize.py", line 299, in <module>
    build_custom_reference(subject, genotype, args.grouping, args.transcriptome, temp)
  File "/home/arcasHLA-quant/scripts/customize.py", line 94, in build_custom_reference
    with open('dat/ref/allele_groups.p','rb') as file:
FileNotFoundError: [Errno 2] No such file or directory: 'dat/ref/allele_groups.p'

Any insights on what the issue could be?

davetang commented 3 years ago

Hi @Michael-Geuenich,

I had the same error because I wasn't running arcasHLA from where I cloned and prepare the repo. I copied the dat directory from the cloned location into where I was executing arcasHLA and that circumvented that error. But then I got the divide by zero error, so maybe it's not advisable to copy the folder over. That said, let me know how it works for you!

Cheers, Dave

tpereachamblee commented 3 years ago

Thank you both for your helpful feedback. I believe @davetang's solution should work for you @Michael-Geuenich the customize script needs to be updated work from outside of the arcasHLA folder (presently it does not).

@davetang as far as your issue running quant, I am still looking into it and will comment here once I think I know what is going on.

roseorenbuch / arcasHLA-quant

Quantification throwing error - missing reference? #6