Parser error, KeyError: 'NORMAL'

MiguelJulia commented 8 years ago

Hi,

It is my first time using PhyloWGS, so maybe I am doing something wrong. The problem is with the parser, when transforming vcf files to input format. I have tried running the example with my standard vcf 4.2 (as there is no sample.vcf file included to check the format) and always get the same error:

$ ./create_phylowgs_inputs.py -s 5000 sanger=Klon3.vcf Traceback (most recent call last): File "create_phylowgs_inputs.py", line 917, in main() File "create_phylowgs_inputs.py", line 882, in main variant_ids, ref_read_counts, total_read_counts = parse_variants(args, vcf_types) File "create_phylowgs_inputs.py", line 817, in parse_variants parsed_variants.append(variant_parser.list_variants()) File "create_phylowgs_inputs.py", line 26, in list_variants ref_reads, total_reads = self._calc_read_counts(variant) File "create_phylowgs_inputs.py", line 107, in _calc_read_counts normal = variant.genotype('NORMAL') File "/usr/local/lib/python2.7/dist-packages/PyVCF-0.6.8-py2.7-linux-x86_64.egg/vcf/model.py", line 277, in genotype return self.samples[self._sample_indexes[name]] KeyError: 'NORMAL'

I have tried without specifying the -s command, filtering the data to create a vcf with only the SNVs, using an annotated vcf created by SnpEff and/or AnnoVar, using a multisample vcf (I read in the questions section that it is not supported, but tried to check if the error was the same), specifying the python version to use running it with python2 and even running it as root. I am out of ideas.

Also, I want to study the phylogeny of a bunch of samples all together, but in the README file of PhyloWGS there is only an example with one input file "ssm_data.txt". That input file is supposed to be generated by the parser from the vcf of one sample only, as it doesn't support multisample vcf files. How can I run more than one file at the same time? And another technical question: Does PhyloWGS works only with non synonymous SNVs, all kind of SNVs or also with every kind of variant (ie INDELs) called in the vcf file?

Thanks,

Miguel

quaidmorris commented 8 years ago

Hi Miguel, Thanks for your question. Would you mind including a short VCF file that reproduces the error?

jwintersinger commented 8 years ago

Hi @MiguelJulia,

The problem is that whatever variant caller you used is not (yet) supported by the parser. Different callers provide read counts in different ways, so we need to support each format separately. If you post a few lines from your sample VCF, I can help you implement support for your format.

MiguelJulia commented 8 years ago

Hi,

This is the first part of one of the vcf files: https://www.dropbox.com/s/mn5hpbxsq2cten6/Klon3.snpeff.phylowgstoparse.PASS.head.vcf?dl=0

As I said, I have tried different versions of the vcf files. This particular one is the one with more information, and was created with mpileup and annotated with snpeff and annovar. I have tried modifing it and removing some kind of variants to check if that was the problem, but it didn't fix it. Shall I remove the mitochondrial and sex chromosomes before creating the input files? And every variant which is not a SNV?

When running the whole file, I get the error explained before:

$ python2 create_phylowgs_inputs.py -s 5000 sanger=Klon3.snpeff.phylowgstoparse.PASS.vcf
Traceback (most recent call last): File "create_phylowgs_inputs.py", line 917, in main() File "create_phylowgs_inputs.py", line 882, in main variant_ids, ref_read_counts, total_read_counts = parse_variants(args, vcf_types) File "create_phylowgs_inputs.py", line 817, in parse_variants parsed_variants.append(variant_parser.list_variants()) File "create_phylowgs_inputs.py", line 26, in list_variants ref_reads, total_reads = self._calc_read_counts(variant) File "create_phylowgs_inputs.py", line 107, in _calc_read_counts normal = variant.genotype('NORMAL') File "/usr/local/lib/python2.7/dist-packages/PyVCF-0.6.8-py2.7-linux-x86_64.egg/vcf/model.py", line 277, in genotype return self.samples[self._sample_indexes[name]] KeyError: 'NORMAL'

But with this smaller part I get this other one:

$ python2 create_phylowgs_inputs.py -s 5000 sanger=Klon3.snpeff.phylowgstoparse.PASS.head.vcf [12:10PM] Traceback (most recent call last): File "create_phylowgs_inputs.py", line 917, in main() File "create_phylowgs_inputs.py", line 882, in main variant_ids, ref_read_counts, total_read_counts = parse_variants(args, vcf_types) File "create_phylowgs_inputs.py", line 817, in parse_variants parsed_variants.append(variant_parser.list_variants()) File "create_phylowgs_inputs.py", line 23, in list_variants variants = self._filter(self._vcf_filename) File "create_phylowgs_inputs.py", line 68, in _filter all_variants = self._parse_vcf(vcf_filename) File "create_phylowgs_inputs.py", line 36, in _parse_vcf for variant in vcfr: File "/usr/local/lib/python2.7/dist-packages/PyVCF-0.6.8-py2.7-linux-x86_64.egg/vcf/parser.py", line 547, in next pos = int(row[1]) ValueError: invalid literal for int() with base 10: '##FILTER=<ID=PASS,Description="All'

Once this is working and I get the input files, how I am supposed to proceed? As I said before, I have a bunch of samples I want to analyse together, but multisample VCFs files are not supported (or at least I read that in other questions).

Thanks a lot,

Miguel

quaidmorris commented 8 years ago

Hi @MiguelJulia

It looks like you are telling PhyloWGS that your VCF file was generated by Sanger, i.e. Caveman. As Jeff indicated, the parser needs to know the number of reads at the locus that support the reference allele, and the number that support the variant allele. It wasn't clear to me from looking at your VCF format what those fields are. Maybe you have some insight? Unfortunately, it seems like every variant caller reports this information differently in their VCF which makes it difficult to support an arbitrary caller.

Best, Quaid.

MiguelJulia commented 8 years ago

Hi @quaidmorris

To find the number of reads in my VCF, look into the INFO field. It looks like this example:

DP=45;VDB=0.652204;SGB=-0.686358;RPB=0.538329;MQB=0.262033;MQSB=0.842187;BQB=0.321882;MQ0F=0.0444444;ICB=1;HOB=0.5;AC=1;AN=2;DP4=13,2,9,5;MQ=55;EFF=missense_variant(MODERATE|MISSENSE|Ggg/Agg

Here you have all the information you could need, in particular: DP = The number of reads covering or bridging POS. DP4 = Number of 1) forward ref alleles; 2) reverse ref; 3) forward non-ref; 4) reverse non-ref alleles, used in variant calling. Sum can be smaller than DP because low-quality bases are not counted.

If it can help you for making this variant caller supported for your parser in a future release, the parser I used is samtools mpileup, and all the meaning of the fields is in this link: http://samtools.sourceforge.net/mpileup.shtml Some extra subfields may have been added later in the INFO field when using SnpEff and Annovar to annotate it.

Thanks,

Miguel

jwintersinger commented 8 years ago

Hi @MiguelJulia,

If you pull the latest revision of PhyloWGS from GitHub, you should be able to parse your file, as I made some slight changes to our DKFZ VCF parser to support your format.

The example VCF file you posted is actually invalid because each line is prefixed by the line number. In addition, the VCF parsing library we're using doesn't like some of the metadata lines that start with ##, so it is sometimes necessary to remove these. To correct both these issues, run this command:

cat example.vcf | cut -f 2- | grep -v '^##' > fixed.vcf

Now you can run our parser on the fixed VCF, using the dkfz VCF flavour:

python2 create_phylowgs_inputs.py dkfz=fixed.vcf

This will create ssm_data.txt and an empty cnv_data.txt. Of course, to permit accurate phylogeny inference, you should have CNV calls delineating at least what regions are normal CN so that we can ignore variants outside such regions, but this should be enough to get you started.

I'm closing this issue as resolved for the moment, but please let me know if it works!

MiguelJulia commented 8 years ago

Hi @jwintersinger

Thanks a lot, it is working! And sorry for the first column, that shouldn't be there (it is not in the real file, I added it by mistake by using 'cat -n' when extracting the first part of the file).

So, now that the program works with my data, how do I create a phylogeny with all my cancer samples together? As multisample VCF are not supported, I split them in individual VCF files, but don't know how to pass them as arguments to evolve.py.

Also, I am trying to use Battenberg to get my CNVs. Could I add the cnv_data.txt generated form Battenberg output to the ssm_data.txt of my VCF file? Or both files should be created from the same bam file with Battenberg?

Thanks,

Miguel

quaidmorris commented 8 years ago

Hi @MiguelJulia,

Have you had a look at the README for the parser that @jwintersinger wrote?

https://github.com/morrislab/phylowgs/tree/master/parser

I think that it answers your questions. Also, have a look at the discussion of this topic in other issues. You aren't alone in your desire to use PhyloWGS on multi-sample data.

Best, Quaid.

MiguelJulia commented 8 years ago

Hi @quaidmorris @jwintersinger

I have tried running evolve.py with the single sample ssm_data.txt file produced by the parser and also with a multiple sample file that I did myself. In both situations, the program start running and after a variable amount of time it stops returning the following error:

python2 /home/mj308/phylowgs-master/evolve.py UKF-NB-3_ssm_data.txt UKF-NB-3_cnv_data.txt [ 9:38PM] [2016-06-18 15:34:07.343604] Starting MCMC run... [2016-06-18 15:34:07.344171] -1000 [2016-06-19 02:35:57.241158] Shrinking MH proposals. Now 200.000000 [2016-06-19 10:09:24.540348] -999 [2016-06-19 20:46:41.833730] Shrinking MH proposals. Now 400.000000 [2016-06-20 04:32:05.608706] -998 [2016-06-20 15:21:28.044754] Shrinking MH proposals. Now 800.000000 [2016-06-20 23:28:19.592745] -997 zsh: segmentation fault python2 /home/mj308/phylowgs-master/evolve.py UKF-NB-3_ssm_data.txt

I have tried to change a few options in the execution of the program, but I always get the same error. Do you know why this is happening?

I am monitoring the resources of my machine when executing it and it only uses 1 core and up to 2% of my RAM, and there is lots of free space in the HDD, so it is not problem of lack of resources. Is there any built in limit in the amount of ssm that the program can analyse?

An example of my input data is attached: head_ssm_data.txt

Thanks again,

Miguel

quaidmorris commented 8 years ago

@MiguelJulia, This is multi-sample? I suspect you have very deep sequencing and multi-samples that are inconsistent with one another. Would you mind sharing some examples of your ssm_data.txt file?

MiguelJulia commented 8 years ago

Hi @quaidmorris

I have tried both with the multi-sample file and the single sample obtained as you suggested earlier in this thread, and in both situations I get the same problem. head_multi_ssm_data.txt head_single_ssm_data.txt

quaidmorris commented 8 years ago

@MiguelJulia, I don't see anything in the files that you sent that would cause this problem -- can you reproduce the behaviour with just these files?

I don't know what's causing the segfault, we've never seen that before, but "Shrinking MH proposals" warning is associated with very deeply sequenced SSMs. Or possibly a CNA that covers a very large part of the genome. What's your largest d value? And did you check that 0 <= a <= d for all SSMs?

How many SSMs do you have?

MiguelJulia commented 8 years ago

Hi @quaidmorris

No, I can't replicate it with those small files. The whole multi-sample file contains 79225 ssm and every single sample contains around 40000. That's why I asked if there is any logical limit built in the code.

Largest d value = 1008

0 <= a <= d for all SSMs. The SSMs with no information in some of the samples were calculated as explained in https://github.com/morrislab/phylowgs/issues/8

MiguelJulia commented 8 years ago

Hi @quaidmorris

Is there any logical limit built in the code? Could you discover why I am getting this error?

quaidmorris commented 8 years ago

Hi Miguel, No explicit logical limits but we generally don't run with >5000 SSMs because that's as many as we generally need to define the subclonal lineages. The error you are getting is due to the likelihood function being highly peaked. Generally this happens with high d values but that does not seem to be the case here. We have not tried PWGS with 79k SSMs with a read depth of 1000 or so, in a multiple sample case. Maybe try using just 5k SSMs?

Q

On Friday, July 8, 2016, MiguelJulia notifications@github.com wrote:

Hi @quaidmorris https://github.com/quaidmorris

Is there any logical limit built in the code? Could you discover why I am getting this error?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/morrislab/phylowgs/issues/33#issuecomment-231302812, or mute the thread https://github.com/notifications/unsubscribe/AFGUdoPO3ZKwvkE_vOBW672LOeMOd34cks5qTgiTgaJpZM4IqqUM .

Quaid Morris, PhD Associate Professor, The Donnelly Centre Departments of Molecular Genetics and Computer Science 160 College St, Rm 616 Toronto ON, M5S 3E1 Canada http://morrislab.med.utoronto.ca cell: (416) 220 5796

morrislab / phylowgs

Parser error, KeyError: 'NORMAL' #33