odelaneau / shapeit5

Segmented HAPlotype Estimation and Imputation Tool
https://odelaneau.github.io/shapeit5/
MIT License
61 stars 9 forks source link

How are the chunking files made? Can I chunk by samples? #30

Open BEFH opened 1 year ago

BEFH commented 1 year ago

I need to chunk for some large VCF files, so I took a look at the chunking files at resources/chunks/b38/20cM/chunks_chr*.txt, but they don't make sense to me. It looks like column 2 is overlapping chunks and column 3 is non-overlapping, but the chunks don't appear to be 20 cM. They seem closer to 37 based on the map files provided.

Also, is there any reason not to chunk based on groups of samples if I have to do that type of chunking downstream?

odelaneau commented 1 year ago

Please do not chunk by groups of samples, this will reduce accuracy.

Instead, use the regions provided in resources. I've just added a README file in resources/chunks/b38/ describing the content of the files.

Hope this helps;

BEFH commented 1 year ago

The readme is definitely useful, but do you have an answer for the other part of the question?

the chunks don't appear to be 20 cM. They seem closer to 37 based on the map files provided.

Thanks.

Also, you mention that chunking by sample will reduce accuracy. That's because of the IBD based algorithm, right? So it will reduce accuracy for rare variants only?

edit: the 37 cM number is for chromosome 22. It looks closer to 41.4 cM on chromosome 17 for the first chunk, though it is 23ish for the second chunk and 21 cM for the third. This is all very confusing for me.

odelaneau commented 1 year ago

Also, you mention that chunking by sample will reduce accuracy. That's because of the IBD based algorithm, right? So it will reduce accuracy for rare variants only?

Common variants will also be affected (to a lesser extend though).

the 37 cM number is for chromosome 22. It looks closer to 41.4 cM on chromosome 17 for the first chunk, though it is 23ish for the second chunk and 21 cM for the third. This is all very confusing for me.

This chunking has been generated in two steps. First, we run the GLIMPSE_chunk algorithm. Second, we manually merge the chunks with poor accuracy in phasing/imputation with neighboring ones. Does that make sense?

BEFH commented 1 year ago

Yes, thanks for all your help. If you could provide example code so I could implement it for b37, that would be great. Otherwise, I appreciate your help!

On Fri, May 26, 2023, 4:27 AM Olivier Delaneau @.***> wrote:

Also, you mention that chunking by sample will reduce accuracy. That's because of the IBD based algorithm, right? So it will reduce accuracy for rare variants only?

Common variants will also be affected (to a lesser extend though).

the 37 cM number is for chromosome 22. It looks closer to 41.4 cM on chromosome 17 for the first chunk, though it is 23ish for the second chunk and 21 cM for the third. This is all very confusing for me.

This chunking has been generated in two steps. First, we run the GLIMPSE_chunk algorithm. Second, we manually merge the chunks with poor accuracy in phasing/imputation with neighboring ones. Does that make sense?

— Reply to this email directly, view it on GitHub https://github.com/odelaneau/shapeit5/issues/30#issuecomment-1564007077, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZ2Z2FE7X6UCZUNQZL4B3TXIBSOXANCNFSM6AAAAAAX2LW7HU . You are receiving this because you authored the thread.Message ID: @.***>