florianhahn87 commented 3 years ago

Hey Lucas, when we run the pipeline on multiple files, it produces one vcf file which has all our samples tap separated in the file. We want to have single, separate output files though for each sample, which e.g. would reduce the number of lines in the file for each sample vcf file (as not all SNPs are present in each sample) and would help us in downstream analysis. Is there a command which leads to separate output file(such as snakemake --cores 40 --use-conda --separate)? Also, we thought we could just run the pipeline again for each sample to create different output files, by just adding one sample in the sample.tsv file. However, as we ran the pipeline already once on all samples, it always still produces an output with all samples in the output file, i guess this has something to do with the pipeline storing all intermediate files? Any suggestions? :) Best, Florian

lczech commented 3 years ago

Hi Florian,

so, grenepipe creates one combined vcf on purpose, because in our typical use case of calling variants for individuals or pools of individuals, we are interested in comparing the samples (columns in the vcf, that is, the calls for each individual or pool) to each other. For this, it makes sense to have them all in one file that you can go through line by line to get all variants at a given position.

You seem to have a different use case, very interesting! Would you mind elaborating on that, that is, why you are interested in samples individually instead of collectively? I guess you don't want to compare them to each other then?

As for how do best do this: I'd suggest to use bcftools for this:

for sample in `bcftools query -l all.vcf.gz`; do
    bcftools view -c1 -Oz -s $sample -o $sample.vcf.gz all.vcf.gz
done

where you use the all.vcf.gz from grenepipe. This will create individual vcf files for each sample/column. Is that what you are looking for?

Cheers and so long Lucas

florianhahn87 commented 3 years ago

Hey Lucas, We created different independant mutant plant lines which show unexpected phenotypes, so we want to see for each plant if these can be explained by tissue culture induced mutations. So we want to compare all plants seperately to a WT/reference sample.

If I work with the all.vcf file for Plant X,Y,Z combined, i could of course just open the vcf file in excel and copy paste out the columns i need for Plant X, I think though that it would make more sense to separate the samples before running them, because then, most variants which are specific for Plant X would not even show up in the files of the other plants. If i take the combined all.vcf file and then use the column for plant X only, i would have a lot of variants to sort out first again, which are like the reference in plant X but maybe not in plant Y or Z, so they would be in the combined vcf file. NOt sure if I explain it well tbh

Cheers, Florian

lczech commented 3 years ago

Ah I see, that makes sense. I would not recommend editing vcf files in Excel though... While the bulk of vcf files is a tab-separated format, the header is not, and that might mess things up. Rather use tools such as bcftools to manipulate the files.

In order to exclude sites in the per-sample files that you want to have, you can use the above bcftools command as well, and additionally provide some filtering. I'm not totally experienced with that, but looking at their documentation, the view command that I used in the command above can do filtering as well, and it seems that --private is the option that might help here.

So in total:

for sample in `bcftools query -l all.vcf.gz`; do
    bcftools view -c1 --private -Oz -s $sample -o $sample.vcf.gz all.vcf.gz
done

Let me know if that works for you :-)

florianhahn87 commented 3 years ago

Amazing, "for sample in bcftools query -l all.vcf.gz; do bcftools view -c1 --private -Oz -s $sample -o $sample.vcf.gz all.vcf.gz ; done" did the job, thanks so much!

lczech commented 3 years ago

Awesome, glad to hear! Do you think that this could be useful for others as well? I could make this an optional last step of the pipeline, in case having individual vcf files is a common requirement.

florianhahn87 commented 3 years ago

Hey Lucas, so I noticed that there were some issues with the bcftools code. It somehow sorted out way too many lines, even though they had alternative alleles. I think the --private option was doing that (I guess because it will just pick sites where only the subset samples carry an non-reference allele, which means that if two of my samples carry a variant at a given site, it would not pick that site), using the same command without that option seem to lead to a result which is closer to what I want. "for sample in bcftools query -l all.vcf.gz; do bcftools view -c1 -Oz -s $sample -o $sample.vcf.gz all.vcf.gz ; done"

However, this approach of having a combined file and then splitting it afterwards leads to issues with downstream analysis. Example: A line from the combined document of 6 samples:

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 15_2_A2 16_2_A2 1_1_A5 1_1_D1 3_10_2 8_1_B4

Chr9 7944706 . AC A,ACC 6279.90 PASS AC=6,4;AF=0.500,0.333;AN=12;BaseQRankSum=1.78;DP=209;ExcessHet=0.2020;FS=3.955;MLEAC=6,4;MLEAF=0.500,0.333;MQ=59.98;MQRankSum=0.00;QD=26.72;ReadPosRankSum=1.65;SOR=0.916 GT:AD:DP:GQ:PL 1/1:0,31,0:31:93:1161,93,0,1161,93,1161 1/1:0,26,0:26:78:974,78,0,974,78,974 2/2:0,0,29:29:87:1021,1021,1021,87,87,0 2/2:0,0,45:45:99:1603,1603,1603,135,135,0 0/0:33,0,0:33:90:0,90,1350,90,1350,1350 1/1:1,42,0:43:99:1558,119,0,1561,126,1568

If I look in my combined vcf file, i get lines like this one, which shows two different variants amongst my samples. However, the samples contain either the one variant or the other. After the bcftools filtering, in my single files for each sample, it will keep all the information in the "alt" and "info" tab, which is a mix of information from all samples still.

Example: Same line after filtering only for sample 1_1_A5

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 1_1_A5

Chr9 7944706 . AC A,ACC 6279.9 PASS AC=0,2;AF=0.5,0.333;AN=2;BaseQRankSum=1.78;DP=209;ExcessHet=0.202;FS=3.955;MLEAC=6,4;MLEAF=0.5,0.333;MQ=59.98;MQRankSum=0;QD=26.72;ReadPosRankSum=1.65;SOR=0.916 GT:AD:DP:GQ:PL 2/2:0,0,29:29:87:1021,1021,1021,87,87,0

So if i want e.g. to look for samples with AF=1 in my single sample files, i would loose a lot of lines even though they might in reality be AF = 1, just because it retains a combined AF value for all samples (not sure if this explanation is clear). So I think the best way would be (at least for my approach), if the whole pipeline would provide separate files from the start with all the information for that specific sample only. Same e.g. with the "alt" alleles, where i have in my files for each separate sample now "alt" alleles, which are not present in that sample. I will try to sort it out with some other ways for the time being but might be worth to keep in mind?

Best, Florian

PS: I guess a solution so far would be to run your pipeline seperately for each sample only? But for that, I would need to delete all the temp files because otherwise, rerunning a sample get a bit of a mess with the previous results with different settings. E.g if I change my sample.tsv file and just have sample 16_2_A2 in there, then rerun grenepipe, I still get a vcf file containing all my samples (and the run takes like 10 seconds, which tells me already that it is not a full new run) So which files would i need to delete to really start a full new run?

lczech commented 3 years ago

Hi Florian,

so, grenepipe is designed as a pipeline for multiple samples - and produces a combined vcf for all of them. Furthermore, calling tools such as freebayes even take information from all the samples into account to make their decision on which variants to call. Hence, the way it works now is by design. I'd be curious to hear from other users though if they have similar requirements as you have - that is, call variants in samples independently. If more have that requirement, I might add a mode for that in the future.

For now, you basically pointed out the two ways to solve your specific requirement: Split the file afterwards, or run the pipeline for each sample separately.

Splitting afterwards seems like the easiest solution, as then you can take full advantage of the power of the pipeline (in terms of scalability in cluster environments, for example). Seeing that bcftools --private was not the right option, you can look for other tools that can do what you want. I guess there are tools with options out there that can do that - if you find one, please report here for the future :-). Alternatively, if there really is no way to achieve your way of filtering with existing tools, it shouldn't be too hard to write your own script that does what you want, for example using Python and a vcf parsing library such as VCFPy or PyVCF.

If you rather want to run the pipeline individually on each sample, I would suggest the following approach: Write a script that takes your existing sample.tsv table, and for each line creates a new directory with a sample table containing only that one line (one sample), and a config file that points to that table. Then, still within that script, call grenepipe, using the snakemake option --directory to point to that directory, see also here. Snakemake will then run with that config and that sample table that only contains one sample. Pseudo code:

for line in samples.tsv [except for the header line]; do
    make a directory, for example, named after the sample
    write new samples.tsv with just the header and one line to that directory
    write config.yaml to that directory, where the "samples" line is replaced by the path to that new table
    snakemake [other options] --directory [point to the new directory]
done

Does that make sense?

Also, to answer your question about which files you'd need to delete: Snakemake only re-runs steps that are necessary to produce the end result files. If you delete intermediate files, but not the end result files, snakemake will do nothing, because the end result files that it wants to create are still there, and so it decides that nothing needs to be done. It can be tricky to figure out what exactly needs to be deleted for it to re-run certain steps - you'd have to follow the dependency graph for that. Alternatively, have a look at the snakemake command line interface, which also has options to force to re-run certain steps. Still, the easiest (but also slowest) option is to just run in a clean directory, as suggested above.

Hope that helps Lucas

florianhahn87 commented 3 years ago

Hey Lucas , thanks for the extensive reply! Yes, there are several ways to split the file into only the columns you need afterwards of course (I put the bcf tool command line for that in the previous post, it is the same as yours, just without the --private option) but it will not solve the problem that some of the information is aggregated in the combined file over all the samples (such as the "alt" and "info" column) and these are difficult/sometimes impossible to dis-aggregate (not sure that is an actual word :D).

I think only running the samples seperately will give me unique alt/info columns for each samples, so I will try to rerun everything in a clean directory.

Thanks for your help and support, highly appreciated!

Best, Florian

lczech commented 3 years ago

Hi Florian,

okay, your case is so special in the sense that having the samples be called completely independent of each other, that it makes sense to go with the approach of running grenepipe individually each time. Sounds good to me. Did my pseudo-script above help you to get this to run?

Let me know if you have any further questions! Lucas

florianhahn87 commented 3 years ago

I did it manually as I only had six files :) Still thanks a lot, the script might be useful if I have more files in the future! Cheers!

moiexpositoalonsolab / grenepipe

separate output files if running on multiple files #4

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 15_2_A2 16_2_A2 1_1_A5 1_1_D1 3_10_2 8_1_B4

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 1_1_A5