Closed stephanflemming closed 10 months ago
Hi. Sorry, just saw this now. I'll try to work through this as I have time. I appreciate the effort you put into going through this.
Off the top of my head, test files you find in the src/test folders (Pisces/src/test/AdaptiveGenotyper.Tests/TestData) are not meant to be used as test input for the applications. They are test input for the unit tests. Some of the files are deliberately pathological to test error handling.
For the rest, I need to go through your questions one by one and check current limits / behaviors, etc.
Thanks
GeminiMulti - takes bam. outputs bam Pisces - takes bam. outputs vcf Scylla - takes bams and vcf. Should work with all VCF, gVCF, uncrushed or crushed variant files as input and output. VariantQualityRecalibration - takes bams and vcf. Should work with all VCF, gVCF, uncrushed or crushed variant files as input and output.
The other applications (AdaptiveGenotyper, Psara, ReformatVcf, VennVcf) are very special purpose. I'm not sure how useful they would be to a general audience unless they felt like tinkering with the code.
re 11. What happens if the number of files for --intervalpaths/-i and --bam differ?
Confirmed, you get a nice error msg. ERROR: Multiple IntervalPaths specified, but number does not correspond with number of BAMPaths.
re 12: Can you provide a test file for --forcedalleles, --priorspath and --adaptivegenotypeparameters_fromfile?
Forcedalles input is a path to a standard vcf . priorspath is also just a path to a standard vcf file. The input for adaptivegenotypeparameters_fromfile is a path to a "model file" and there is an example in AdaptiveGenotyper.Tests/TestData/example.model
re 13. I dont think you as a human user ever need to worry about this one. Its exposed because of a cheap trick we once did to parallelize the code. This was a flag the wrapper manipulated.
re 14. --vffilter, --minvariantfrequencyfilter defaults to 0.01f, ie 1%
re 15 I believe that is correct. "crushvcf" and "forcedalleles" dont work well together. Those two mode were meant to be obsoleted in the future (imo, neither is worth maintaining) so the corner case where both were on together was never worked through. The output vcf formating of the two modes run together would be very messy.
minvq|minvariantqscore Is the min quailty score a variant has to achieve in order to be emitted as a variant and not a "no call". By default is set to 20. The execeptional case is a variant that is a "forced allele," which will still be called, even if it below this threshold. These variants may still get tagged as filtered, if they dont make the filter thresholds.
re 16. The -debug option writes some extra lines to the log, but its not intended to be useful to anyone expect a developer. What 'extra stuff' got written to the log varies version to verison, depending on what we were working on at the time. If -debug and -OutputBiasFiles options are used together, you can get a lot of extra files written with information about strand bias and read count. That might be the extra files you are thinking about.
re 17 ha. Yes, you are correct. The data type stated in the help is clearly wrong.
re 18. OK. Thats not the best behavior. I'll fix it when/if I do a re-release. IMO, anything output should go to the output folder.
re 19. I believe it respects and preserves gvcf.
re 20.. OK, not the best log file name. I think the motivation was that Scylla.log caused a problem when multple vcfs and multple instances of scylla were running and writing to the same directory. So we wanted separate logs per vcf.
re 21 - You need to specify what kind of data you want to keep track of, to determine what output files to use. DoBasicChecks will just check for the number of different categories of SNPs, looking to see if something is over-represented and maybe indicating something going wrong the experiment. DoAmpliconPositionChecks does similar but only around a window kicked on by a jump/fall in coverage suggestive of the edge of an amplicon.
if (options.DoBasicChecks) { CountsFileWriter.WriteCountsFile(basicCountsPath, basicCountsData); }
if (options.DoAmpliconPositionChecks) { CountsFileWriter.WriteCountsFile(edgeCountsPath, edgeVariantsCountData); }
re 22 - Was this a test you ran your self? (if yes, what was your command line input?) Or was it an error thrown by one of our canned unit tests? For the error, my guess is that it was trying to perform recalibration, and it wanted a file with the count of variants found in the amplicon edges, to see if they were over represented (thus possibly should have their qscores downgraded by the recalibrator). This amplicon-edge-awareness feature in the recalbrator was one of the last features I was working on before i left the company and its really more of a stub that anything completely developed.
re 23 - that's just the limitations of the parser and nothing to do with science. The parser in the program was tightly coupled to a naming convention and file structure used at illumina. This could be lifted to make it more user friendly.
Thanks again for your efforts with this.
Hi,
we want to integrate the following Pisces tools in Galaxy (PR)
and there are some questions left.
1) Which of these tools take VCF, gVCF, uncrushed or crushed variant files as input and output?
2) Not all min|max|default values for each parameter are mentioned inOptions.used.json files, but it might be a good idea, if you have a look on this https://docs.google.com/spreadsheets/d/1cMTCFptA9eQnMMoCJAbuhU4pQ14lzcgqENNNfs6uG6o/edit?usp=sharing ... I used "?" for missing entries, e.g.
--help
. I collected some information from the wiki and?, ?, 1
means, that (possible) min and max values are not known, but the default value is "1".3) Is there a difference between
<not set>/null
andfalse
for some parameters, e.g.--crushvcf
?AdaptiveGenotyper
4) Is it possible to use more than one model file (
--model/--models
)?5) The output format of "recal variants" (VCF or gVCF) depends on the input format of parameter
--vcf
?6) Can you provide a small test file (VCF) that runs without errors? The test files in
Pisces/src/test/AdaptiveGenotyper.Tests/TestData
cause errors likeor
GeminiMulti
7) Doesn't seem to work with samtools > 1.10.
8) There are some conditionals for the parameter
--stitchonly
and--realignonly
. Is the following order correct?neither
--stitchonly true
nor--realignonly true
--stitchonly true
--realignonly true
9) What are possible values for
--stringtagstokeepfromr1
?10) Do
--categoriestorealign
and--categoriestosnowball
share the same set of possible values? And what are the default values of--categoriestosnowball
?Pisces
11) What happens if the number of files for
--intervalpaths/-i
and--bam
differ?12) Can you provide a test file for
--forcedalleles
,--priorspath
and--adaptivegenotypeparameters_fromfile
?13) What is the effect of
--insidesubprocess true
?14) What is the default value for
--vffilter
,--minvariantfrequencyfilter
?15) Are the following conditionals correct? And are parameters missing here (e.g.
--minvq
, ...)?--forcedalleles <file>
and--crushvcf true
cannot be used together ?--callmnvs true
--collapse true
--ploidy somatic
--ploidy diploid
--ploidy DiploidByAdaptiveGT
16) What is the effect of
--debug true
? I cannot find extra generated files that are mentioned in--help
.17) I think that the following entries of
--help
are wrong.ReformatVcf
18)
*.uncrushed.vcf
result file is not created in--outfolder
but in the folder the input VCF file is located in.19) This subtool creates a crushed VCF and takes an uncrushed VCF as input? What about gVCF here?
Scylla
20) Log file has the same name as the input file (*.vcf).
VariantQualityRecallibration
21)
*.counts
,*.edgevariants
,*.edgecounts
are not always created. On which parameter do they depend on?22) In some tests an error message like the following appeared.
CreateGenomeSize
23) Why is it important, that a genome description fulfills the following check
GSFOptions.SpeciesName.Split(' ').Length < 3)
?Thank you in advance, Stephan