openvax / varlens

commandline manipulation of genomic variants and NGS reads
Apache License 2.0
19 stars 2 forks source link

Three suggestions for reporting. #8

Closed JPFinnigan closed 8 years ago

JPFinnigan commented 8 years ago

Total Counts. If would be great if output of --varlens-allele-support included a column containing the total count for all alleles at a given locus.

source contig interbase_start interbase_end allele count "CT_TOTAL"
B16.F1_0821 1 164181635 164181636 A 17 58
B16.F1_0821 1 164181635 164181636 T 41 58

REF/ALT It would also be extremely helpful if REF/ALT status were carried forward from the .vcf. Example:

source contig interbase_start interbase_end allele "REF/ALT" count
B16.F1_0821 1 164181635 164181636 A ALT 17
B16.F1_0821 1 164181635 164181636 T REF 41

Deletions When working with .vcf containing deletions, varlens fills in the 'allele' column with a blank. Perhaps it would be better if the results were more consistent with the .vcf reporting of deletions. For example, varlens currently reports as follows:

source contig interbase_start interbase_end allele count
B16.F1_0821 1 120036406 120036407 182
B16.F1_0821 1 120036406 120036407 C 1
B16.F1_0821 1 120036406 120036407 T 172
B16.F1_0821 1 120036406 120036407 G 2
B16.F1_0821 1 156690394 156690403 N 13

Whereas the .vcf output for this locus looks like this:

CHROM | POS | ID | REF | ALT |

------------ | ------------- | ------------ | ------------- | ------------ | chr1|120036406|.|GT|G|

The varlens reporting convention for deletions makes it a little difficult to go back to the .vcf and search for the allele of interest, since the ALT allele in varlens "_" differs from the alt allele in the .vcf "G". I know that this is an issue with legacy base- versus the preferred space-counting coordinate systems, but perhaps we can come up with an interim solution in lieu of wider adoption of space-counting coordinate systems in .vcf?

timodonnell commented 8 years ago

Thanks for the suggestions @JPFinnigan . I think I can address the first two points by adding a new command line tool to directly annotate a collection of variants with read counts. I actually need this for some work I'm doing now too. I'll take a stab at it tomorrow.

The third point is a bit tougher as varcode automatically drops the shared prefix of the ref and alt alleles. The 'GT>G' style for deletions is also confusing for read counting since it's not clear how to treat a read that has a deletion at that position (120036407) but with a different base preceding it (e.g. an A at position 120036406 instead of a G). I'll look into adding an option for this though if it's not too complicated.

timodonnell commented 8 years ago

This needs some more testing and documentation, but the varlens-variants tool now supports counting reads that match the ref / alt / total. Here's an example:

$ varlens-variants \
   --include-read-evidence \
   --variants test/data/CELSR1/vcfs/vcf_1.vcf \
   --reads test/data/CELSR1/bams/bam_0.bam \
   --variant-genome b37 \
   --out /tmp/results.csv

This gives csv results like this:

genome,contig,interbase_start,interbase_end,ref,alt,num_alt,num_ref,total_depth
GRCh37,22,21829554,21829555,T,G,0.0,0.0,0.0
GRCh37,22,46931059,46931060,A,C,0.0,50.0,50.0
GRCh37,22,46931061,46931062,G,A,0.0,51.0,51.0
GRCh37,22,50636217,50636218,A,C,0.0,0.0,0.0
GRCh37,22,50875932,50875933,A,C,0.0,0.0,0.0

Other features:

See varlens-variants -h if you're interested in any of these.

Interested to hear how this works for you if you end up using it.

timodonnell commented 8 years ago

@JPFinnigan did this address your issue? Let me know if there's anything else you need

JPFinnigan commented 8 years ago

Hey Tim,

I'll talk a look ASAP, and then get back to you. Thanks for looking into this!

JPFinnigan commented 8 years ago

Hey Tim,

Finally got around to using this. Ran varlens-variants w/ the following input:

JPF-MBP:varlens johnfinnigan$ varlens-variants \
> --include-read-evidence \
> --reads ~/Desktop/TEMP/Results/RNA/Tumor_B16.F10/ISMMS/Tumor_B16.F10_0810.127A/BAM/HISAT2/Tumor_B16.F10_0810.127A.HISAT2.sorted.bam \
> --reads ~/Desktop/TEMP/Results/RNA/Tumor_B16.F10/ISMMS/Tumor_B16.F10_0810.131B/BAM/HISAT2/Tumor_B16.F10_0810.131B.HISAT2.sorted.bam \
> --variants ~/Desktop/TEMP/Results/WES/Tumor_B16.F10_0810/ISMMS/VCF/MuTect/Tumor_B16_F10_0810.mutect.targets.pass.vcf \
> --variants ~/Desktop/TEMP/Results/WES/Tumor_B16.F10_0810/ISMMS/VCF/Strelka/results/passed.somatic.snvs.vcf \
> --out ~/Desktop/Tumor.B16.F10.0810.mutect.targets.pass.vcf_Tumor.B16.F10.0810.strelka.passed.snvs.vcf_Tumor.B16.F10.0810.127A.HISAT2.sorted.bam_Tumor.B16.F10.0810.131B.HISAT2.sorted.bam_varlens.csv

Received the following error message:

Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/2.7/bin/varlens-variants", line 9, in <module>
    load_entry_point('varlens', 'console_scripts', 'varlens-variants')()
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pkg_resources/__init__.py", line 547, in load_entry_point
    return get_distribution(dist).load_entry_point(group, name)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pkg_resources/__init__.py", line 2720, in load_entry_point
    return ep.load()
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pkg_resources/__init__.py", line 2380, in load
    return self.resolve()
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pkg_resources/__init__.py", line 2386, in resolve
    module = __import__(self.module_name, fromlist=['__name__'], level=0)
  File "/Users/johnfinnigan/Desktop/Utilities/Varlens/varlens/varlens/commands/variants.py", line 22, in <module>
    from .. import variant_includes
  File "/Users/johnfinnigan/Desktop/Utilities/Varlens/varlens/varlens/variant_includes.py", line 12, in <module>
    from . import mhc_binding
  File "/Users/johnfinnigan/Desktop/Utilities/Varlens/varlens/varlens/mhc_binding.py", line 19, in <module>
    import topiary
  File "/Users/johnfinnigan/Desktop/Utilities/Topiary/topiary/topiary/__init__.py", line 7, in <module>
    from .predict_epitopes import (
  File "/Users/johnfinnigan/Desktop/Utilities/Topiary/topiary/topiary/predict_epitopes.py", line 20, in <module>
    from .commandline_args import (
  File "/Users/johnfinnigan/Desktop/Utilities/Topiary/topiary/topiary/commandline_args.py", line 185, in <module>
    "netmhc": mhctools.NetMHC,
AttributeError: 'module' object has no attribute 'NetMHC'

I'm thinking this means that varlens is looking for, but not finding NetMHC. Presumably, this is because mhctools was just updated to go from NetMHC to NetMHC3 NetMHC4. Also, this brings up a f/u question--are you sure you want to include MHC binding predictions by default in varlens-variants?

timodonnell commented 8 years ago

Thanks for the bug report @JPFinnigan . As it looks like you determined, this was due to a topiary issue https://github.com/hammerlab/mhctools/pull/48#issuecomment-186895428

It's a reasonable point though that maybe we shouldn't try to import topiary (and thus depend on mhctools) unless the user is specifically asking for variant predictions. Reopening this issue to potentially address that