stschiff / sequenceTools

Other
40 stars 10 forks source link

Invalid option `-p' #23

Closed npsonis closed 2 years ago

npsonis commented 2 years ago

Hi again,

I am running pileupCaller with -p POP enabled but I get the following error

Invalid option `-p' Did you mean one of these? -h -d -f -e

The same if I use: --plinkOut POP

It works with -e. If neither is used I get: Missing: (-e|--eigenstratOut )

Perhaps -p is not implemented yet?

Version: 1.5.1

stschiff commented 2 years ago

Hmm, that's odd. Are you sure you're using the option correctly? It expects a filename as argument, like -p my_plink_file_prefix. You can check the online help using pileupCaller --help. Have I perhaps made a mistake in the documentation somewhere?

npsonis commented 2 years ago

Yes my filename is POP. It works with -e POP.

stschiff commented 2 years ago

Could you please post here the output of pileupCaller --help?

npsonis commented 2 years ago

Usage: pileupCaller [--version] (--randomHaploid | --majorityCall [--downSampling] | --randomDiploid) [--keepIncongruentReads] [--seed ] [-d|--minDepth ] [--skipTransitions | --transitionsMissing | --singleStrandMode] (-f|--snpFile ) [(-e|--eigenstratOut ) [--samplePopName POP] | (-p|--plinkOut ) [--samplePopName POP]] (--sampleNames NAME1,NAME2,... | --sampleNameFile ) PileupCaller is a tool to create genotype calls from bam files using read-sampling methods. To use this tool, you need to convert bam files into the mpileup-format, specified at http://www.htslib.org/doc/samtools.html (under "mpileup"). The recommended command line to create a multi-sample mpileup file to be processed with pileupCaller is

  samtools mpileup -B -q30 -Q30 -l <BED_FILE> -R -f <FASTA_REFERENCE_FILE>
      Sample1.bam Sample2.bam Sample3.bam | pileupCaller ...

You can lookup what these options do in the samtools documentation. Note that flag -B in samtools is very important to reduce reference bias in low coverage data.

This tool is part of sequenceTools version 1.5.1

Available options: --version Print version and exit -h,--help Show this help text --randomHaploid This method samples one read at random at each site, and uses the allele on that read as the one for the actual genotype. This results in a haploid call --majorityCall Pick the allele supported by the most reads at a site. If an equal numbers of alleles fulfil this, pick one at random. This results in a haploid call. See --downSampling for best practices for calling rare variants --downSampling When this switch is given, the MajorityCalling mode will downsample from the total number of reads a number of reads (without replacement) equal to the --minDepth given. This mitigates reference bias in the MajorityCalling model, which increases with higher coverage. The recommendation for rare-allele calling is --majorityCall --downsampling --minDepth 3 --randomDiploid Sample two reads at random (without replacement) at each site and represent the individual by a diploid genotype constructed from those two random picks. This will always assign missing data to positions where only one read is present, even if minDepth=1. The main use case for this option is for estimating mean heterozygosity across sites. --keepIncongruentReads By default, pileupCaller now removes reads with tri-allelic alleles that are neither of the two alleles specified in the SNP file. To keep those reads for sampling, set this flag. With this option given, if the sampled read has a tri-allelic allele that is neither of the two given alleles in the SNP file, a missing genotype is generated. IMPORTANT NOTE: The default behaviour has changed in pileupCaller version 1.4.0. If you want to emulate the previous behaviour, use this flag. I recommend now to NOT set this flag and use the new behaviour. --seed random seed used for the random number generator. If not given, use system clock to seed the random number generator. -d,--minDepth specify the minimum depth for a call. For sites with fewer reads than this number, declare Missing (default: 1) --skipTransitions skip transition SNPs entirely in the output, resulting in a dataset with fewer sites. --transitionsMissing mark transitions as missing in the output, but do output the sites. --singleStrandMode [THIS IS CURRENTLY AN EXPERIMENTAL FEATURE]. At C/T polymorphisms, ignore reads aligning to the forward strand. At G/A polymorphisms, ignore reads aligning to the reverse strand. This should remove post-mortem damage in ancient DNA libraries prepared with the non-UDG single-stranded protocol. -f,--snpFile an Eigenstrat-formatted SNP list file for the positions and alleles to call. All positions in the SNP file will be output, adding missing data where there is no data. Note that pileupCaller automatically checks whether alleles in the SNP file are flipped with respect to the human reference, and in those cases flips the genotypes accordingly. But it assumes that the strand-orientation of the SNPs given in the SNP list is the one in the reference genome used in the BAM file underlying the pileup input. Note that both the SNP file and the incoming pileup data have to be ordered by chromosome and position, and this is checked. The chromosome order in humans is 1-22,X,Y,MT. Chromosome can generally begin with "chr". In case of non-human data with different chromosome names, you should convert all names to numbers. They will always considered to be numerically ordered, even beyond 22. Finally, I note that for internally, X is converted to 23, Y to 24 and MT to 90. This is the most widely used encoding in Eigenstrat databases for human data, so using a SNP file with that encoding will automatically be correctly aligned to pileup data with actual chromosome names X, Y and MT (or chrX, chrY and chrMT, respectively). -e,--eigenstratOut Set Eigenstrat as output format. Specify the filenames for the EigenStrat SNP, IND and GENO file outputs: .snp, .ind and

.geno. If not set, output will be FreqSum (Default). Note that freqSum format, described at https://rarecoal-docs.readthedocs.io/en/latest/rarecoal-tools.html#vcf2freqsum, is useful for testing your pipeline, since it's output to standard out --samplePopName POP specify the population name of the samples, which is included in the output *.ind.txt file in Eigenstrat output. This will be ignored if the output format is not Eigenstrat (default: "Unknown") -p,--plinkOut Set Plink as output format. Specify the filenames for the Plink BIM, FAM and BED file outputs: .bim, .fam and .bed. If not set, output will be FreqSum (Default). Note that freqSum format, described at https://rarecoal-docs.readthedocs.io/en/latest/rarecoal-tools.html#vcf2freqsum, is useful for testing your pipeline, since it's output to standard out --samplePopName POP specify the population name of the samples, which is included in the output *.ind.txt file in Eigenstrat output. This will be ignored if the output format is not Eigenstrat (default: "Unknown") --sampleNames NAME1,NAME2,... give the names of the samples as comma-separated list (no spaces) --sampleNameFile give the names of the samples in a file with one name per line My command that works is the folowing (does not work with -p instead of -e): pileupCaller --randomHaploid --sampleNameFile Kinship_IDs.txt --samplePopName POP -e POP -f 1240K.snp < 1240K.pileup
stschiff commented 2 years ago

OK, the issue is that the -e flag means --eigenstratOut, while the -p flag means --plinkOut. You can't have both of these at the same time. When you try without the -e POP flag this command above should work.

The following command should work with the testData provided in the repository:

cd test/testDat
pileupCaller --sampleNames 12880A,12881A,12883A,12885A --randomHaploid --singleStrandMode -f 1240k_eigenstrat_snp_short.snp.txt -p testOut < AncientBritish.short.pileup.txt
npsonis commented 2 years ago

Sorry for reopened the thread, but the problem remains. This works: pileupCaller --randomHaploid --sampleNameFile Kinship_IDs.txt --samplePopName POP -e POP -f 1240K.snp < 1240K.pileup This doesn't: pileupCaller --randomHaploid --sampleNameFile Kinship_IDs.txt --samplePopName POP -p POP -f 1240K.snp < 1240K.pileup

I copied the command and just changed the letter e to p, to make sure that no typing error exists.

The error remains the same as I wrote previously: Invalid option `-p' Did you mean one of these? -h -d -f -e

However, this does works properly, as you propose: cd test/testDat pileupCaller --sampleNames 12880A,12881A,12883A,12885A --randomHaploid --singleStrandMode -f 1240k_eigenstrat_snp_short.snp.txt -p testOut < AncientBritish.short.pileup.txt

EDIT!!! I just figure out that if I change the order of the arguments it works: pileupCaller --randomHaploid --sampleNameFile Kinship_IDs.txt -p POP -f 1240K.snp --samplePopName POP < 1240K.pileup

So, the problem is with --samplePopName preceding -p.

stschiff commented 2 years ago

Yes, I understand now. Indeed, there is an implied order in these arguments, because --samplePopName makes only sense with either -p or -e set. I agree this is a bit weird and intransparent. I can easily fix that. I reopen this issue to remind myself.

stschiff commented 2 years ago

This is fixed in v1.5.2