genfile sample names - Githubissues

GuyReeves commented 2 years ago

Hi

I am trying to get the genfile option to work. However I can not figure out what the sample names should be.

The Instruction are "File has a header row with a name for each sample, matching what is found in the bam file." genfile is tab delineated and has one more row than posfile (which has no header)

I have tried a sample names

the sample name in the sampleNames_file
the file name in the bamlist (without the path)
the SM from the .bam header
the RG from the .bam header

All give the same error

_[2022-07-02 11:20:29] Get and validate pos and gen Error in names(x) <- value : 'names' attribute [4] must be the same length as the vector [1] Calls: STITCH ... get_and_validate_pos_gen_and_phase -> get_and_validatepos -> colnames<- Execution halted I can send the head of each file if that helps.

Thanks

rwdavies commented 2 years ago

Hmm, it should be the SM from the .bam header

See for example https://github.com/rwdavies/STITCH/blob/8f6df131004bf2335701c001cd5a5710f05fde0d/STITCH/tests/testthat/test-unit-reads.R#L5

This first example uses the header below to get the sample name "jimmy"

@HD VN:1.5  SO:coordinate
@SQ SN:1    LN:45
@RG ID:7369_8x15    SM:jimmy
@PG ID:samtools PN:samtools VN:1.12 CL:samtools view -H /var/folders/yy/5xsy75511pg691vl8jkmr209vrqq7r/T//RtmprBYHdk/jimmy.bam

GuyReeves commented 2 years ago

so if for a single sample in the following files sampleNames_file = B106_10 bamlist =path/b106_10_BtSamOutputFixmate.sorted.dedup.q20.concordant.bam .bam header SM:573. of which I use "573" in the genfile

does the the incompatibility between the SM and the file name mean I cannot use this option without changing the file name so it is 573.bam?

or I can change the .bam header so SM:b106_10_BtSamOutputFixmate.sorted.dedup.q20.concordant ?

rwdavies commented 2 years ago

So the filename itself shouldn't make a difference That being said, I think there's a bug here The code works fine without the sampleNames_file However when supplied with the sampleNames_file, it ignores this new sample name, and tries to match genfile with the original sample names (the value in @RG SM tag)

So in your case, I think the easiest fix would be having your desired output name in the sampleNames_file, then matching the SM tag in the RG entry to the relevant column header in the genfile

I'll try to fix this soon

rwdavies commented 2 years ago

Sorry I spoke too soon, I wrote a test, and this seems to be fine?

So in your case

sampleNames_file = B106_10 bamlist =path/b106_10_BtSamOutputFixmate.sorted.dedup.q20.concordant.bam .bam header SM:573. of which I use "573" in the genfile

You can have whatever you want in the .bam header (573), then in the genfile, you would put whatever you had in the sampleNames_file so here B106_10 would need to match genfile

GuyReeves commented 2 years ago

HI

I have tried using B106_10

but I get the same error.

I have attached the small files for my test. The bamlist is not actually used as I am using , regenerateInput = FALSE, regenerateInputWithDefaultValues = TRUE.

test_genfile.zip

(https://github.com/rwdavies/STITCH/files/9039384/b106_10header.sam.txt) m.txt]

rwdavies commented 2 years ago

OK now I understand. That doesn't seem to work, because the sample RData and sample names have already been extracted. So I think your use case is not-standard, and I have changed the validation to throw an error in that situation

To get around it though, you can modify your code to do something like the following, which should work

https://github.com/rwdavies/STITCH/blob/6c71665399cd304b8dab2b0fe22aed253dac3e88/STITCH/tests/testthat/test-acceptance-sample-names.R#L86

GuyReeves commented 2 years ago

I would prefer not to have to change the sample name.

If you think it will work I would prefer to redo it all but with regenerateInput = TRUE -are you saying that this should work ?

rwdavies commented 2 years ago

I'm a little lost between your preferred workflow, and the workflow are interested in specifically for this case. Can you re-state them please?

GuyReeves commented 2 years ago

I usually use this command to generate the basic analysis a

STITCH(tempdir = tempdir(), chrStart = 1, chrEnd = 23506264, chr = "chr2L", sampleNames_file = "7257samplelist.txt", nCores = 1, bamlist = "7257_bamlist.txt", posfile = "posChr2Lnoindel.txt", outputdir = paste0(getwd(), "/"), K = 8, nGen = 11, originalRegionName = "chr2L", regenerateInput = TRUE, expRate = 0.4, niterations = 1, splitReadIterations = FALSE, refillIterations = NA, shuffleHaplotypeIterations = NA, reference_haplotype_file = "ref_haplotype_NP_only_chr2L.txt" , reference_legend_file = "ref_legend_SNP_only_chr2L.txt.gz", outputInputInVCFFormat = TRUE)

I then use rsync to duplicate the data /folder to try varying parameters- though genefile is the very last thing I want to try before I stick with what I have and generate a final dataset- so I really don't need any changes made to program to fit my non-standard path, I am happy to adapt to what is there. Particually if I only need to add "regenerateInput = TRUE", and delete , "regenerateInputWithDefaultValues = TRUE" from the command below

STITCH(tempdir = tempdir(), chrStart = 1, chrEnd = 23506264, chr = "chr2L", sampleNames_file = "7257samplelist.txt", genfile = "b106_10_genfile_test.txt", posfile = "231604_posChr2L.txt", outputdir = paste0(getwd(), "/"), K = 8, nGen = 11, nCores = 10, originalRegionName = "chr2L", regenerateInput = FALSE, regenerateInputWithDefaultValues = TRUE, expRate = 0.4, niterations = 1, splitReadIterations = FALSE, refillIterations = NA, shuffleHaplotypeIterations = NA, reference_haplotype_file = "231604_ref_haplotype_chr2L.txt" , reference_legend_file = "231604_ref_legend_chr2L.txt.gz")

Thanks for all the help

GuyReeves commented 2 years ago

To be honest I am not great at coding and I would not know were to start "you can modify your code to do something like the following, which should work"

rwdavies commented 2 years ago

Sorry for the slow reply, I was off sick for a bit. Also re: your last comment, I'll try to make sure my answers are clear. That comment was about showing that the genfile header should match (in your case) the sample name as specified in the sampleNames_file.

To be clear, genfile doesn't affect imputation performance, it allows you to get an estimate of imputation performance as the program runs (e.g. to see, on the fly, if parameter changes are helpful, or how many iterations you might need to get accuracy you feel is sufficient). So if you're at the stage of your analysis where you're trying to optimize parameters, you don't need to optimize inclusion or removal of genfile

Now, about the code you sent above. Why are you using outputInputInVCFFormat = TRUE, do you want a VCF of the reads available, or do you just want the RData format that STITCH uses? If the later, then you can use generateInputOnly = TRUE, which is slightly faster, but programatically cleaner (you're doing what you want to do directly, not indirectly)

As a general comment, I agree with rsyncing the data sample RData files to new folders to try new parameter values, that is a major reason that generateInputOnly exists as an option. However, I am a bit worried about the niterations = 1 value in the second block of code? Given you don't have a reference panel, you'll basically be imputing noise, because the EM algorithm won't have had a chance to run. You might also want to impute smaller regions, that is a large (~23 Mbp) region, and you have a lot of samples (~7200 I think), you might be better off running smaller regions e.g. 5 Mbp with 500 kbp buffers

GuyReeves commented 2 years ago

No worries, I really appreciate your enormous patience with me.

I had missed the that the genfile does not improve performance, consequently I will drop it ( the last thing I did was edit the .bam header using picard so the sample_name = RG:SM, I got the same validation error).

as far as I was following some excellent advice you gave before, though I am under th eimptression that in all my examples aboive I was using a reference panel -reference_haplotype_file = "231604_ref_haplotype_chr2L.txt" , https://github.com/rwdavies/STITCH/issues/62

If you think your inferred founder haplotypes contain no phase switch errors, you can either set niterations to 1 (to impute using only information from the reference haplotypes), or generally to something low like 5 or 10 (which would use the founder haplotypes to initialize, then proceed normally) (you'd probably want to turn off shuffleHaplotypeIterations here). If you think your inferred founder haplotypes might contain phase switch errors, you can set niterations to something like 20, then keep a few shuffleHaplotypeIterations e.g. 4, 8, 12. shuffleHaplotypeIterations looks for places where breaking and re-setting ancestral haplotypes improves the likelihood, i.e. here phase switch errors among founders. I would recommend disabling splitReadIterations and refillIterations if you go with reference haplotypes, especially if you're reasonably sure you have all the founder haplotypes, as they won't be necessary.

I had forgotten about trying the a low number of iterations and turning off shuffleHaplotypeIterations . Almost half of the dataset (which has a high degree of heterozygosity) is part of trio so we can use mendelian error rate to asses quality. It is very good.
I will try >1 iteration and shuffleHaplotypeIterations = c(4, 8, 12, 16), and see if the 200 individuals with >1% error decreases. Thanks

GuyReeves commented 2 years ago

"You might also want to impute smaller regions, that is a large (~23 Mbp) region, and you have a lot of samples (~7200 I think), you might be better off running smaller regions e.g. 5 Mbp with 500 kbp buffers"

will this lead to better results or is this just to reduce computing resources ?

rwdavies commented 2 years ago

Oh OK I see above re: haplotype and legend file, my apologies I missed that

About imputing in chunks, that's just for computational reasons, it shouldn't lead to more accurate imputation

rwdavies / STITCH

genfile sample names #68