mskcc / tempo

CCS research pipeline to process WES and WGS TN pairs
https://cmotempo.netlify.com/
12 stars 5 forks source link

create_metadata_file.py is filtering out `Splice_Region` #804

Closed anoronh4 closed 4 years ago

anoronh4 commented 4 years ago

The create_metadata_file.py script in MetaDataParser process uses the number of records in the maf file to report tumor mutational burden but currently filters out Splice_Region, probably designed as such because it is not in official MAF specs. Instead it accepts Splice_Site which is a standard value and to my understanding refers to the same thing.

This line filters the maf file https://github.com/mskcc/tempo/blob/master/containers/metadataparser/create_metadata_file.py#L283 one solution is to change vcf2maf package (since Splice_Site is considered non-standard anyways), or just do a replace of Splice_Region -> Splice_Site, or filter in the value for calculation of TMB.

vigneshravi commented 4 years ago

Tried renaming "Splice_Region" to "Splice_Site" in the maf and re-ran the script, but still returns the same error - No columns to parse from file /juno/work/ccs/ravichav/Hellman_Exomes/IlluminaExome_38MB/TempoMegatron/containers/metadataparser/create_metadata_file.py --sampleID SU2LC_MSK_1089_T__SU2LC_MSK_1089_N --tumorID SU2LC_MSK_1089_T --normalID SU2LC_MSK_1089_N --facetsPurity_out ../../../TempoMegatron/results/somatic/SU2LC_MSK_1089_T__SU2LC_MSK_1089_N/facets/SU2LC_MSK_1089_T__SU2LC_MSK_1089_N/facets0.5.14c100pc500/SU2LC_MSK_1089_T__SU2LC_MSK_1089_N_purity.out --facetsQC /juno/work/ccs/ravichav/Hellman_Exomes/IlluminaExome_38MB/TempoMegatron/results/somatic/SU2LC_MSK_1089_T__SU2LC_MSK_1089_N/facets/SU2LC_MSK_1089_T__SU2LC_MSK_1089_N/facets0.5.14c100pc500/SU2LC_MSK_1089_T__SU2LC_MSK_1089_N.qc.txt --MSIsensor_output /juno/work/ccs/ravichav/Hellman_Exomes/IlluminaExome_38MB/work/43/9a63b7a3b82cfa11efe0d5d67bcd1a/SU2LC_MSK_1089_T__SU2LC_MSK_1089_N.msisensor.tsv --mutational_signatures_output /juno/work/ccs/ravichav/Hellman_Exomes/IlluminaExome_38MB/work/4f/9538da3636c2d72918ce611177ec2d/SU2LC_MSK_1089_T__SU2LC_MSK_1089_N.mutsig.txt --polysolver_output /juno/work/ccs/ravichav/Hellman_Exomes/IlluminaExome_38MB/work/f6/356aee770f90a453aba46889d9c9c4/SU2LC_MSK_1089_N.hla.txt --MAF_input /juno/work/ccs/ravichav/Hellman_Exomes/IlluminaExome_38MB/TempoMegatron/results/somatic/SU2LC_MSK_1089_T__SU2LC_MSK_1089_N/combined_mutations/SU2LC_MSK_1089_T__SU2LC_MSK_1089_N.somatic.final.maf --coding_baits_BED /juno/work/taylorlab/cmopipeline/mskcc-igenomes/grch37/coding_regions/AgilentExon_51MB_b37_v3_baits.coding.sorted.merged.bed

Traceback (most recent call last): File "/juno/work/ccs/ravichav/Hellman_Exomes/IlluminaExome_38MB/TempoMegatron/containers/metadataparser/create_metadata_file.py", line 191, in <module> resultdf = pd.read_csv(resulting_intersection.fn, sep="\t", header=None) File "/work/offit/Programz/AnacondaForPy27/miniconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 702, in parser_f return _read(filepath_or_buffer, kwds) File "/work/offit/Programz/AnacondaForPy27/miniconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 429, in _read parser = TextFileReader(filepath_or_buffer, **kwds) File "/work/offit/Programz/AnacondaForPy27/miniconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 895, in __init__ self._make_engine(self.engine) File "/work/offit/Programz/AnacondaForPy27/miniconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 1122, in _make_engine self._engine = CParserWrapper(self.f, **self.options) File "/work/offit/Programz/AnacondaForPy27/miniconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 1853, in __init__ self._reader = parsers.TextReader(src, **kwds) File "pandas/_libs/parsers.pyx", line 545, in pandas._libs.parsers.TextReader.__cinit__ pandas.errors.EmptyDataError: No columns to parse from file

As a secondary check, tried bedtools intersect between the agilent file and the maf - there are two matching records as output, which are SILENT and INTRON

anoronh4 commented 4 years ago

@vigneshravi this is likely due to the fact that even with the additional variant, the maf still did not intersect with any coding regions, and the python script was still unable to parse the empty result, which is the core issue behind your error. i will make a new issue for that problem.