Added option to read transcript FPKM values from StringTie GTF file

iskandr commented 8 years ago

Implements long requested feature (https://github.com/hammerlab/topiary/issues/39) from @JPFinnigan.

Added load_transcript_fpkm_dict_from_gtf which generates a transcript ID -> FPKM dictionary from a given GTF file (using gtfparse for most of the actual work).
Changed flags --rna-gene-fpkm-file becomes --rna-gene-fpkm-tracking-file. Similarly --rna-transcript-fpkm-file becomes --rna-transcript-fpkm-tracking-file.
The above changes make room for a new flag: --rna-transcript-fpkm-gtf-file. There's no gene version of this flag since StringTie seems to only estimate transcript-level FPKMs. If someone using a different tool requests it, we can easily extend the code to handle gene_id's as well.
Added simple unit test for GTF loading.

JPFinnigan commented 8 years ago

Hey Guys,

Not sure if this is helpful, but I thought that if it helped identify a potential bug it might be worth posting. I cloned this branch and tried to run w/ the same .gtf Alex used to test the additions to the PR and got the following error.

Topiary commandline arguments:
Namespace(ic50_cutoff=500.0, json_variant_files=[], maf=[], mhc_alleles='H2-Kb,H2-Db', mhc_alleles_file=None, mhc_epitope_lengths=[8, 9, 10, 11], mhc_predictor='netmhcpan', only_novel_epitopes=False, output_csv='/Users/johnfinnigan/Desktop/TEMP/Results/MTA/Tumor_B16.F1_0821/RNA/Tumor_B16.F1_0821_mutect.targets.pass.vcf.netmhcpan.RNA.csv', output_html=None, padding_around_mutation=None, percentile_cutoff=None, reference_name=None, rna_gene_fpkm_tracking_file=None, rna_min_gene_expression=0.0, rna_min_transcript_expression=0.1, rna_transcript_fpkm_gtf_file='/Users/johnfinnigan/Desktop/TEMP/Results/RNA/Tumor_B16.F1/Tumor_B16.F1_0821.115B/GTF/StringTie/HISAT2/Tumor_B16.F1_0821.115B.HISAT2.sorted.gtf', rna_transcript_fpkm_tracking_file=None, skip_variant_errors=False, variant=[], vcf=['/Users/johnfinnigan/Desktop/TEMP/Results/WES/Tumor_B16_F1_0821/ISMMS/VCF/MuTect/Tumor_B16_F1_0821.mutect.targets.pass.vcf'], wildtype_ligandome_directory=None)
INFO:root:Building MHC binding prediction type for alleles ['H-2-Kb', 'H-2-Db'] and epitope lengths [8, 9, 10, 11]
INFO:root:Skipping allele SLA-1-CHANGDA: Malformed MHC type 1
INFO:root:Skipping allele SLA-1-HB01: Malformed MHC type 1
INFO:root:Skipping allele SLA-1-HB02: Malformed MHC type 1
INFO:root:Skipping allele SLA-1-HB03: Malformed MHC type 1
INFO:root:Skipping allele SLA-1-HB04: Malformed MHC type 1
INFO:root:Skipping allele SLA-1-LWH: Malformed MHC type 1
INFO:root:Skipping allele SLA-1-TPK: Malformed MHC type 1
INFO:root:Skipping allele SLA-1-YC: Malformed MHC type 1
INFO:root:Skipping allele SLA-1-YDL01: Malformed MHC type 1
INFO:root:Skipping allele SLA-1-YTH: Malformed MHC type 1
INFO:root:Skipping allele SLA-2-YDL02: Malformed MHC type 2
INFO:root:Skipping allele SLA-3-CDY: Malformed MHC type 3
INFO:root:Skipping allele SLA-3-HB01: Malformed MHC type 3
INFO:root:Skipping allele SLA-3-LWH: Malformed MHC type 3
INFO:root:Skipping allele SLA-3-TPK: Malformed MHC type 3
INFO:root:Skipping allele SLA-3-YC: Malformed MHC type 3
INFO:root:Skipping allele SLA-3-YDL: Malformed MHC type 3
INFO:root:Skipping allele SLA-3-YDY01: Malformed MHC type 3
INFO:root:Skipping allele SLA-3-YDY02: Malformed MHC type 3
INFO:root:Skipping allele SLA-3-YTH: Malformed MHC type 3
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/2.7/bin/topiary", line 6, in <module>
    exec(compile(open(__file__).read(), __file__, 'exec'))
  File "/Users/johnfinnigan/Desktop/Utilities/Topiary/topiary/scripts/topiary", line 64, in <module>
    main()
  File "/Users/johnfinnigan/Desktop/Utilities/Topiary/topiary/scripts/topiary", line 46, in main
    epitopes = predict_epitopes_from_args(args)
  File "/Users/johnfinnigan/Desktop/Utilities/Topiary/topiary/topiary/predict_epitopes.py", line 278, in predict_epitopes_from_args
    transcript_expression_dict = rna_transcript_expression_dict_from_args(args)
  File "/Users/johnfinnigan/Desktop/Utilities/Topiary/topiary/topiary/commandline_args.py", line 304, in rna_transcript_expression_dict_from_args
    args.rna_transcript_fpkm_tracking_file)
  File "/Users/johnfinnigan/Desktop/Utilities/Topiary/topiary/topiary/rna/gtf.py", line 47, in load_transcript_fpkm_dict_from_gtf
    column_converters={fpkm_column_name: float})
  File "/Users/johnfinnigan/Desktop/Utilities/Topiary/gtfparse/gtfparse/gtfparse/read_gtf.py", line 58, in read_gtf_as_dict
    if not exists(filename):
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/genericpath.py", line 18, in exists
    os.stat(path)
TypeError: coercing to Unicode: need string or buffer, NoneType found

My command line input:

JPF-MBP:~ johnfinnigan$ python /Library/Frameworks/Python.framework/Versions/2.7/bin/topiary \
> --vcf ~/Desktop/TEMP/Results/WES/Tumor_B16_F1_0821/ISMMS/VCF/MuTect/Tumor_B16_F1_0821.mutect.targets.pass.vcf \
> --mhc-predictor netmhcpan \
> --mhc-alleles H2-Kb,H2-Db \
> --mhc-epitope-lengths 8,9,10,11 \
> --ic50-cutoff 500 \
> --rna-transcript-fpkm-gtf-file ~/Desktop/TEMP/Results/RNA/Tumor_B16.F1/Tumor_B16.F1_0821.115B/GTF/StringTie/HISAT2/Tumor_B16.F1_0821.115B.HISAT2.sorted.gtf \
> --rna-min-transcript-expression 0.1 \
> --output-csv ~/Desktop/TEMP/Results/MTA/Tumor_B16.F1_0821/RNA/Tumor_B16.F1_0821_mutect.targets.pass.vcf.netmhcpan.RNA.csv

Any ideas?

iskandr commented 8 years ago

Sorry @JPFinnigan, I was passing the wrong filename to the GTF parser. Try again?

tavinathanson commented 8 years ago

Looks good to me % minor questions. @JPFinnigan good find, I wouldn't have caught that error in my review.

tavinathanson commented 8 years ago

@iskandr LGTM

openvax / topiary

Added option to read transcript FPKM values from StringTie GTF file #40