rwdavies / STITCH

STITCH - Sequencing To Imputation Through Constructing Haplotypes
http://www.nature.com/ng/journal/v48/n8/abs/ng.3594.html
GNU General Public License v3.0
74 stars 19 forks source link

impute on set of WGBS bams + reference 1000G bams #18

Closed avilella closed 1 year ago

avilella commented 5 years ago

Hi,

We are wanting to test an imputation method for a set of low-coverage WGBS human bams, which, being from bisulfite and low-coverage, only contain a subset of callable SNVs (e.g. many C->T SNVs are lost at low-coverage WGBS).

We would like to try STITCH by combining a set of these low-cov WGBS bams with a set of 1000G bams, which hopefully will act as a pseudo-reference panel for the WGBS samples.

Do you have any recommendations on running STITCH and/or the parameters to use for this task?

Thanks in advance, twitter @albertvilella

rwdavies commented 5 years ago

Hey, sorry for my slow reply, I was moving house on Wednesday, and caught up in move related activities the surrounding days. Two options come to mind

1) Use a reference panel with 1 iteration (i.e. no updating), something like this https://github.com/rwdavies/STITCH/blob/master/examples/examples.R#L432 This is probably the easiest and definitely the fastest. You can use either all reference sample populations or only those that are most closely related if you're imputing a relatively homogenous population (e.g. Europeans or East Asians etc) 2) Use all BAMs together and impute. For all samples, but specifically affecting the 100G samples, you could set downsampleToCov to something rather low, like 1 or 4, as coverage over 4X won't really affect the analysis, but will slow things down. Again, same comment about related 1000G samples

Otherwise something like method = diploid and K as 20 or 40 are worth trying, conditional on speed and RAM etc.

Hope that helps? Definitely take care on what SNVs to include, both for imputing, and for benchmarking. In previous work with very short reads (~35bp), I've seen overall accuracy at all SNVs get decreased by the inclusion of SNVs in imputation where having either the ref or alt at the SNV could allow the read to map. Hopefully your C->T SNV removal will only be very weakly related to haplotype age so won't affect things unduly

Best, Robbie

avilella commented 5 years ago

Thanks for the detailed response: it was really helpful to narrow down what I could try with my data and STITCH.

I had a look at this and I think I have it sorted out so that I can start to try option 1, except I only need to decide on the contents of posfile (pos.txt). What would be a good starting point for whole-genome human set of SNVs to impute? Is there a default one that people use on hg38 that is conservative enough as a starting point?

My starting point for test 1 is a handful of low-coverage bams and the equivalent 30x WGS for the same samples. This means we kind of know the truth of SNV calling by doing the expensive high-coverage experiment for these). Later on, this would work only on the low-coverage bams, without the 30x WGS equivalent bams from the same individuals.

All these samples are of EUR ancestry (based on estimate from the iAdmix calculateGLL ancestry tool https://github.com/vibansal/ancestry ).

Long story short, where could I get an hg38 posfile that would be a good starting point to attempt to run this?

Thanks in advance.

rwdavies commented 5 years ago

Hey, sorry for slow reply again, I was on holiday at a cottage in rural Quebec without internet., and then moving houses again. Done holidays / moving for now!

Since they are all European, a good starting point might be 1000 Genomes bi-allelic SNPs with EUR MAF > 1%. Depending on how low coverage you go you might want to look at >5% (e.g. if <0.1X)

avilella commented 5 years ago

Thanks that sounds like a good plan.

On Mon, Jul 8, 2019 at 2:18 PM rwdavies notifications@github.com wrote:

Hey, sorry for slow reply again, I was on holiday at a cottage in rural Quebec without internet., and then moving houses again. Done holidays / moving for now!

Since they are all European, a good starting point might be 1000 Genomes bi-allelic SNPs with EUR MAF > 1%. Depending on how low coverage you go you might want to look at >5% (e.g. if <0.1X)

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/rwdavies/STITCH/issues/18?email_source=notifications&email_token=AABGSN47IFGLCABCMVUXWPTP6M5BXA5CNFSM4HY6R7H2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZNBUKY#issuecomment-509221419, or mute the thread https://github.com/notifications/unsubscribe-auth/AABGSN36PXCO4TCBYFEH3B3P6M5BXANCNFSM4HY6R7HQ .