tamsen / Pisces

Somatic and germline variant caller for amplicon data. Recommended caller for tumor-only workflows.
GNU General Public License v3.0
7 stars 0 forks source link

Some questions related to Galaxy integration of Pisces #11

Closed stephanflemming closed 10 months ago

stephanflemming commented 3 years ago

Hi,

we want to integrate the following Pisces tools in Galaxy (PR)

and there are some questions left.

1) Which of these tools take VCF, gVCF, uncrushed or crushed variant files as input and output?

2) Not all min|max|default values for each parameter are mentioned in --help. I collected some information from the wiki and Options.used.json files, but it might be a good idea, if you have a look on this https://docs.google.com/spreadsheets/d/1cMTCFptA9eQnMMoCJAbuhU4pQ14lzcgqENNNfs6uG6o/edit?usp=sharing ... I used "?" for missing entries, e.g. ?, ?, 1 means, that (possible) min and max values are not known, but the default value is "1".

3) Is there a difference between <not set>/null and false for some parameters, e.g.--crushvcf?

AdaptiveGenotyper

4) Is it possible to use more than one model file (--model/--models)?

5) The output format of "recal variants" (VCF or gVCF) depends on the input format of parameter --vcf?

6) Can you provide a small test file (VCF) that runs without errors? The test files in Pisces/src/test/AdaptiveGenotyper.Tests/TestData cause errors like

ERROR: No BAM files found in data
Stack trace:
   at CommandLine.Options.BamProcessorParsingUtils.UpdateBamPathsWithBamsFromFolder(String bamPathsString)
   at CommandLine.Options.BamProcessorParsingUtils.<>c__DisplayClass0_0.<GetBamProcessorParsingMethods>b__0(String value)
   at CommandLine.NDesk.Options.OptionSet.<>c__DisplayClass7_0.<Add>b__0(OptionValueCollection v)
   at CommandLine.NDesk.Options.OptionSet.ActionOption.OnParseComplete(OptionContext c)
   at CommandLine.NDesk.Options.Option.Invoke(OptionContext c)
   at CommandLine.NDesk.Options.OptionSet.ParseValue(String option, OptionContext c)
   at CommandLine.NDesk.Options.OptionSet.Parse(String argument, OptionContext c)
   at CommandLine.NDesk.Options.OptionSet.Parse(IEnumerable`1 arguments)
   at CommandLine.Options.BaseOptionParser.DoGenericParsingAndValidation(String[] args, Dictionary`2 optionSetDics, Boolean includeValidationWithParsing)

or

ERROR: Unknown ploidy model ''diploid''

Stack trace:
   at CommandLine.Options.VariantCallingOptionsParserUtils.ConvertToPloidy(String value)
   at CommandLine.Options.VariantCallingOptionsParserUtils.<>c__DisplayClass0_0.<GetVariantCallingParsingMethods>b__11(String value)
   at CommandLine.NDesk.Options.OptionSet.<>c__DisplayClass7_0.<Add>b__0(OptionValueCollection v)
   at CommandLine.NDesk.Options.OptionSet.ActionOption.OnParseComplete(OptionContext c)
   at CommandLine.NDesk.Options.Option.Invoke(OptionContext c)
   at CommandLine.NDesk.Options.OptionSet.ParseValue(String option, OptionContext c)
   at CommandLine.NDesk.Options.OptionSet.Parse(String argument, OptionContext c)
   at CommandLine.NDesk.Options.OptionSet.Parse(IEnumerable`1 arguments)
   at CommandLine.Options.BaseOptionParser.DoGenericParsingAndValidation(String[] args, Dictionary`2 optionSetDics, Boolean includeValidationWithParsing)

GeminiMulti

7) Doesn't seem to work with samtools > 1.10.

8) There are some conditionals for the parameter --stitchonly and --realignonly. Is the following order correct?

9) What are possible values for --stringtagstokeepfromr1?

10) Do --categoriestorealign and --categoriestosnowball share the same set of possible values? And what are the default values of --categoriestosnowball?

Pisces

11) What happens if the number of files for --intervalpaths/-i and --bam differ?

12) Can you provide a test file for --forcedalleles, --priorspath and --adaptivegenotypeparameters_fromfile?

13) What is the effect of --insidesubprocess true?

14) What is the default value for --vffilter, --minvariantfrequencyfilter?

15) Are the following conditionals correct? And are parameters missing here (e.g. --minvq, ...)?

16) What is the effect of --debug true? I cannot find extra generated files that are mentioned in --help.

17) I think that the following entries of --help are wrong.

--maxmnvlength <FOLDER> FOLDER Max length phased SNPs that can be called
--trackedanchorsize <FLOAT> ...

ReformatVcf

18) *.uncrushed.vcf result file is not created in --outfolder but in the folder the input VCF file is located in.

19) This subtool creates a crushed VCF and takes an uncrushed VCF as input? What about gVCF here?

Scylla

20) Log file has the same name as the input file (*.vcf).

VariantQualityRecallibration

21) *.counts, *.edgevariants, *.edgecounts are not always created. On which parameter do they depend on?

22) In some tests an error message like the following appeared.

4/7/21 4:02 PM 1  Exception: System.IO.FileNotFoundException: Could not find file '/home/stephan/Projects/tools/pisces/test/vqr4/control.edgevariants'.
File name: '/home/stephan/Projects/tools/pisces/test/vqr4/control.edgevariants'
   at Interop.ThrowExceptionForIoErrno(ErrorInfo errorInfo, String path, Boolean isDirectory, Func`2 errorRewriter)
   at Microsoft.Win32.SafeHandles.SafeFileHandle.Open(String path, OpenFlags flags, Int32 mode)
   at System.IO.FileStream..ctor(String path, FileMode mode, FileAccess access, FileShare share, Int32 bufferSize, FileOptions options)
   at System.IO.FileStream..ctor(String path, FileMode mode)
   at VariantQualityRecalibration.VariantListReader.ReadVariantListFile(String file)
   at VariantQualityRecalibration.QualityRecalibration.GetRecalibrationTables(SignatureSorterResultFiles resultsFilePaths, VQROptions options)
   at VariantQualityRecalibration.QualityRecalibration.Recalibrate(SignatureSorterResultFiles countsFilePaths, VQROptions options).
4/7/21 4:02 PM 1  Work complete.
4/7/21 4:02 PM 1  ******************** Ending *********************

Type in --help at the description of parameter -z "averge".

CreateGenomeSize

23) Why is it important, that a genome description fulfills the following check GSFOptions.SpeciesName.Split(' ').Length < 3)?

Thank you in advance, Stephan

tamsen commented 3 years ago

Hi. Sorry, just saw this now. I'll try to work through this as I have time. I appreciate the effort you put into going through this.

Off the top of my head, test files you find in the src/test folders (Pisces/src/test/AdaptiveGenotyper.Tests/TestData) are not meant to be used as test input for the applications. They are test input for the unit tests. Some of the files are deliberately pathological to test error handling.

For the rest, I need to go through your questions one by one and check current limits / behaviors, etc.

Thanks

tamsen commented 3 years ago
  1. I'd limit your integration task to "work horse" applications like

GeminiMulti - takes bam. outputs bam Pisces - takes bam. outputs vcf Scylla - takes bams and vcf. Should work with all VCF, gVCF, uncrushed or crushed variant files as input and output. VariantQualityRecalibration - takes bams and vcf. Should work with all VCF, gVCF, uncrushed or crushed variant files as input and output.

The other applications (AdaptiveGenotyper, Psara, ReformatVcf, VennVcf) are very special purpose. I'm not sure how useful they would be to a general audience unless they felt like tinkering with the code.

tamsen commented 3 years ago
  1. Can you please grant me access to your spread sheet?
  2. Yes. All the options being applied (those explicitly given through the command line by the user and the defaults being applied) will be in the Options.used.json file. If the parameter value for an option is set to "null" then its not applicable to the run. Ie, if AmpliconBiasFilterThreshold" is null, then AmpliconBiasFilter is not going to be used. If "IsMale" is null, then no gender will be presumed for the sample (ie, no special treatment for sex chr).
  3. Punt for now. Sidney Kuo was the lead dev for that project and would be the best to ask. I can keep this on my TODO list.
  4. Punt for now. Sidney Kuo was the lead dev for that project and would be the best to ask. I can keep this on my TODO list.
  5. Punt for now. Sidney Kuo was the lead dev for that project and would be the best to ask. I can keep this on my TODO list.
  6. It works for me on samtools 1.11 . What kind of error are you getting? What version of samtools? If the samtools merge cmd syntax changed since, that would certainly cause a crash. There is a flag to switch syntax (see SamtoolsOldStyle)
  7. That looks right. Gwenn Berry was lead dev for Gemini and would be the best to ask.
  8. Its open-ended. By default, the newly-stitched read only has minimal bam tags imported from the original reads. Since anyone can decorate their bams with any new tags, its impossible to anticipate how the user wants them merged. We had some use case where certain special bam tags needed to be preserved in the final merged bam, so the option to bring them along from the r1 read had to be created. (the choice of r1 vs r2 was random)
  9. yes, and the default for CategoriesForSnowballing are empty.
  10. I believe the program will stop running and give you a nice error msg, but I will double check. There may be some presumption that if you give exactly one interval file and many bams, the same interval file will be applied to all of them..
tamsen commented 3 years ago

re 11. What happens if the number of files for --intervalpaths/-i and --bam differ?

Confirmed, you get a nice error msg. ERROR: Multiple IntervalPaths specified, but number does not correspond with number of BAMPaths.

re 12: Can you provide a test file for --forcedalleles, --priorspath and --adaptivegenotypeparameters_fromfile?

Forcedalles input is a path to a standard vcf . priorspath is also just a path to a standard vcf file. The input for adaptivegenotypeparameters_fromfile is a path to a "model file" and there is an example in AdaptiveGenotyper.Tests/TestData/example.model

re 13. I dont think you as a human user ever need to worry about this one. Its exposed because of a cheap trick we once did to parallelize the code. This was a flag the wrapper manipulated.

re 14. --vffilter, --minvariantfrequencyfilter defaults to 0.01f, ie 1%

re 15 I believe that is correct. "crushvcf" and "forcedalleles" dont work well together. Those two mode were meant to be obsoleted in the future (imo, neither is worth maintaining) so the corner case where both were on together was never worked through. The output vcf formating of the two modes run together would be very messy.

minvq|minvariantqscore Is the min quailty score a variant has to achieve in order to be emitted as a variant and not a "no call". By default is set to 20. The execeptional case is a variant that is a "forced allele," which will still be called, even if it below this threshold. These variants may still get tagged as filtered, if they dont make the filter thresholds.

re 16. The -debug option writes some extra lines to the log, but its not intended to be useful to anyone expect a developer. What 'extra stuff' got written to the log varies version to verison, depending on what we were working on at the time. If -debug and -OutputBiasFiles options are used together, you can get a lot of extra files written with information about strand bias and read count. That might be the extra files you are thinking about.

re 17 ha. Yes, you are correct. The data type stated in the help is clearly wrong.

re 18. OK. Thats not the best behavior. I'll fix it when/if I do a re-release. IMO, anything output should go to the output folder.

re 19. I believe it respects and preserves gvcf.

re 20.. OK, not the best log file name. I think the motivation was that Scylla.log caused a problem when multple vcfs and multple instances of scylla were running and writing to the same directory. So we wanted separate logs per vcf.

re 21 - You need to specify what kind of data you want to keep track of, to determine what output files to use. DoBasicChecks will just check for the number of different categories of SNPs, looking to see if something is over-represented and maybe indicating something going wrong the experiment. DoAmpliconPositionChecks does similar but only around a window kicked on by a jump/fall in coverage suggestive of the edge of an amplicon.

if (options.DoBasicChecks) { CountsFileWriter.WriteCountsFile(basicCountsPath, basicCountsData); }

if (options.DoAmpliconPositionChecks) { CountsFileWriter.WriteCountsFile(edgeCountsPath, edgeVariantsCountData); }

re 22 - Was this a test you ran your self? (if yes, what was your command line input?) Or was it an error thrown by one of our canned unit tests? For the error, my guess is that it was trying to perform recalibration, and it wanted a file with the count of variants found in the amplicon edges, to see if they were over represented (thus possibly should have their qscores downgraded by the recalibrator). This amplicon-edge-awareness feature in the recalbrator was one of the last features I was working on before i left the company and its really more of a stub that anything completely developed.

re 23 - that's just the limitations of the parser and nothing to do with science. The parser in the program was tightly coupled to a naming convention and file structure used at illumina. This could be lifted to make it more user friendly.

Thanks again for your efforts with this.