PacBio Sequel II data - Githubissues

Malabady commented 4 years ago

Hi, I am trying to use genomescope to estimate heterozygosity and genome size of a fairly large plant genome, ~ 3.65Gbp. I am using the histogram of canu-corrected reads. Here is what I am getting:

GenomeScope version 1.0 k = 22

property min max
Heterozygosity 0.775102% 0.795864%
Genome Haploid Length NA bp NA bp
Genome Repeat Length NA bp NA bp
Genome Unique Length 974,154,308 bp 977,407,537 bp
Model Fit 86.4268% 93.3631%
Read Error Rate NA% NA%

I am not sure why it is not estimating the genome size? Also, given that pacbio reads are of variable lengths, what would be the ideal read length to give to genomescope?

Many thanks,

mschatz commented 4 years ago

We have had mixed results using GenomeScope with error corrected long reads as the error correction can distort heterozygous kmers and other properties. But it is odd that it is estimating the genome size to be ~1GB but you expect ~3.6. Can you send the link to the profile? Maybe something was confused with the model fitting routine.

Cheers

Mike

Malabady commented 4 years ago

Hi Mike,

Here is the link: http://genomescope.org/analysis.php?code=example2

In your tests with corrected long reads, what read length did you use?

Thanks Magdy

mschatz commented 4 years ago

Hi Magdy,

That is one of the same projects. Can you double check and send a link with a code number, sort of like this one: http://qb.cshl.edu/genomescope/analysis.php?code=fAGB1i4BmhEBAP9lDXaP

Alternatively, can you send the kmer histogram file?

And it turns out the read length isnt actually used in the final output. It very slightly helps initialize the model fitting, but any value will get to the same result. We are actually removing it as a parameter from the next version of genomescope.

Thanks

Mike

Malabady commented 4 years ago

Sorry, here is it: http://qb.cshl.edu/genomescope/analysis.php?code=wuSFuJbgWbECTETlCiXR

and the histogram file run.ms22.histo.txt

mschatz commented 4 years ago

Hi Magdy,

I ran the histogram through our new version of GenomeScope and the results look a lot better: http://qb.cshl.edu/genomescope/genomescope2.0/analysis.php?code=xuGAx8mhFBqxnRUV2z68

This shows the genome size to be about 3.4Gb and the het rate to be about 0.8%. This is probably slightly underestimated since error correcting long reads can overcorrect what are real heterozygous bases, but should be fairly accurate. In addition to using a better optimization function, the new version considers all kmers so it is also able to pick up on high copy repeats better. We actually posted a preprint about the updated model this morning to the bioRxiv and will be available shortly that describes all the improved methods. If you want to process any other datasets, the website is available here: http://genomescope.org/genomescope2.0

Cheers

Mike

On Mon, Aug 26, 2019 at 9:54 AM Malabady notifications@github.com wrote:

Sorry, here is it: http://qb.cshl.edu/genomescope/analysis.php?code=wuSFuJbgWbECTETlCiXR

and the histogram file run.ms22.histo.txt https://github.com/schatzlab/genomescope/files/3541438/run.ms22.histo.txt

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/schatzlab/genomescope/issues/24?email_source=notifications&email_token=AABP346RAEMATEDU3NEA6GDQGPOANA5CNFSM4IPGUJ6KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5EOBYY#issuecomment-524869859, or mute the thread https://github.com/notifications/unsubscribe-auth/AABP342RROKL5UE7OUE4JWTQGPOANANCNFSM4IPGUJ6A .

Malabady commented 4 years ago

Thanks a lot Mike! this is looking much better & is within the expectation for sure. Do you have a link to the preprint? I need to read about the interpretation of the new plots (coverageFrequency vs coverage). I see that results are the same in both profiles (Frequency Vs Coverage & FrequencyCoverage Vs Coverage). Is this to be expected?

mschatz commented 4 years ago

The preprint is currently being screened, but should be available later today or tomorrow. Yes, there are four plots now because we do a mathematically transformation to the data that helps with highly heterozygous samples. But the results will be identical on the transformed and original data (we always fit to the transformed data, and just display the results twice for visualization purposes)

Good luck!

Mike

On Mon, Aug 26, 2019 at 12:24 PM Malabady notifications@github.com wrote:

Thanks a lot Mike! this is looking much better & is within the expectation for sure. Do you have a link to the preprint? I need to read about the interpretation of the new plots (coverageFrequency vs coverage). I see that results are the same in both profiles (Frequency Vs Coverage & FrequencyCoverage Vs Coverage). Is this to be expected?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/schatzlab/genomescope/issues/24?email_source=notifications&email_token=AABP345JHNGALX6B2MAJ2ZDQGP7TZA5CNFSM4IPGUJ6KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5E4FFY#issuecomment-524927639, or mute the thread https://github.com/notifications/unsubscribe-auth/AABP347BWHLCCPFSGZ4DK5DQGP7TZANCNFSM4IPGUJ6A .

mschatz commented 4 years ago

FYI - The preprint is now available on the biorxiv: https://www.biorxiv.org/content/10.1101/747568v1

Cheers

Mike

On Mon, Aug 26, 2019 at 12:34 PM Michael Schatz michael.schatz@gmail.com wrote:

The preprint is currently being screened, but should be available later today or tomorrow. Yes, there are four plots now because we do a mathematically transformation to the data that helps with highly heterozygous samples. But the results will be identical on the transformed and original data (we always fit to the transformed data, and just display the results twice for visualization purposes)

Good luck!

Mike

On Mon, Aug 26, 2019 at 12:24 PM Malabady notifications@github.com wrote:

Thanks a lot Mike! this is looking much better & is within the expectation for sure. Do you have a link to the preprint? I need to read about the interpretation of the new plots (coverageFrequency vs coverage). I see that results are the same in both profiles (Frequency Vs Coverage & FrequencyCoverage Vs Coverage). Is this to be expected?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/schatzlab/genomescope/issues/24?email_source=notifications&email_token=AABP345JHNGALX6B2MAJ2ZDQGP7TZA5CNFSM4IPGUJ6KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5E4FFY#issuecomment-524927639, or mute the thread https://github.com/notifications/unsubscribe-auth/AABP347BWHLCCPFSGZ4DK5DQGP7TZANCNFSM4IPGUJ6A .

Malabady commented 4 years ago

Many thanks, Mike. And congrats on the paper. Best! Magdy

From: Michael Schatz notifications@github.com Reply-To: schatzlab/genomescope reply@reply.github.com Date: Monday, August 26, 2019 at 4:35 PM To: schatzlab/genomescope genomescope@noreply.github.com Cc: Magdy Alabady malabady@uga.edu, Author author@noreply.github.com Subject: Re: [schatzlab/genomescope] PacBio Sequel II data (#24)

[External Sender] FYI - The preprint is now available on the biorxiv: https://www.biorxiv.org/content/10.1101/747568v1

Cheers

Mike

On Mon, Aug 26, 2019 at 12:34 PM Michael Schatz michael.schatz@gmail.com wrote:

The preprint is currently being screened, but should be available later today or tomorrow. Yes, there are four plots now because we do a mathematically transformation to the data that helps with highly heterozygous samples. But the results will be identical on the transformed and original data (we always fit to the transformed data, and just display the results twice for visualization purposes)

Good luck!

Mike

On Mon, Aug 26, 2019 at 12:24 PM Malabady notifications@github.com wrote:

Thanks a lot Mike! this is looking much better & is within the expectation for sure. Do you have a link to the preprint? I need to read about the interpretation of the new plots (coverageFrequency vs coverage). I see that results are the same in both profiles (Frequency Vs Coverage & FrequencyCoverage Vs Coverage). Is this to be expected?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/schatzlab/genomescope/issues/24?email_source=notifications&email_token=AABP345JHNGALX6B2MAJ2ZDQGP7TZA5CNFSM4IPGUJ6KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5E4FFY#issuecomment-524927639, or mute the thread https://github.com/notifications/unsubscribe-auth/AABP347BWHLCCPFSGZ4DK5DQGP7TZANCNFSM4IPGUJ6A .

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/schatzlab/genomescope/issues/24?email_source=notifications&email_token=ACHNEMB64H545M6N3NY4EFLQGQ5B7A5CNFSM4IPGUJ6KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5FTDCI#issuecomment-525021577, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ACHNEMDVGLKRGEJSATTCBPTQGQ5B7ANCNFSM4IPGUJ6A.

Malabady commented 4 years ago

Hi Michael,

In the model fir stats, what does it mean when there is big gap between the min (35%) and max (96%)? Does it mean that ~ 35% of the data fits the model as good as a 96%? Also, is the read error rate accurate?

Many thanks for your help. Magdy

From: Michael Schatz notifications@github.com Reply-To: schatzlab/genomescope reply@reply.github.com Date: Monday, August 26, 2019 at 4:35 PM To: schatzlab/genomescope genomescope@noreply.github.com Cc: Magdy Alabady malabady@uga.edu, Author author@noreply.github.com Subject: Re: [schatzlab/genomescope] PacBio Sequel II data (#24)

[External Sender] FYI - The preprint is now available on the biorxiv: https://www.biorxiv.org/content/10.1101/747568v1

Cheers

Mike

On Mon, Aug 26, 2019 at 12:34 PM Michael Schatz michael.schatz@gmail.com wrote:

The preprint is currently being screened, but should be available later today or tomorrow. Yes, there are four plots now because we do a mathematically transformation to the data that helps with highly heterozygous samples. But the results will be identical on the transformed and original data (we always fit to the transformed data, and just display the results twice for visualization purposes)

Good luck!

Mike

On Mon, Aug 26, 2019 at 12:24 PM Malabady notifications@github.com wrote:

Thanks a lot Mike! this is looking much better & is within the expectation for sure. Do you have a link to the preprint? I need to read about the interpretation of the new plots (coverageFrequency vs coverage). I see that results are the same in both profiles (Frequency Vs Coverage & FrequencyCoverage Vs Coverage). Is this to be expected?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/schatzlab/genomescope/issues/24?email_source=notifications&email_token=AABP345JHNGALX6B2MAJ2ZDQGP7TZA5CNFSM4IPGUJ6KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5E4FFY#issuecomment-524927639, or mute the thread https://github.com/notifications/unsubscribe-auth/AABP347BWHLCCPFSGZ4DK5DQGP7TZANCNFSM4IPGUJ6A .

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/schatzlab/genomescope/issues/24?email_source=notifications&email_token=ACHNEMB64H545M6N3NY4EFLQGQ5B7A5CNFSM4IPGUJ6KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5FTDCI#issuecomment-525021577, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ACHNEMDVGLKRGEJSATTCBPTQGQ5B7ANCNFSM4IPGUJ6A.

mschatz commented 4 years ago

Those are pretty typical results -- it is a weighted result and only considers part of the distribution (it is really just for debugging). For example see this nice fit in wheat: http://qb.cshl.edu/genomescope/genomescope2.0/analysis.php?code=example11

The error correction looks generally successful. The error peak (kmers with frequency near 1) is broader than you would get with high quality Illumina data but is good for error corrected long reads.

Good luck

Mike

On Thu, Aug 29, 2019 at 9:04 AM Malabady notifications@github.com wrote:

Hi Michael,

In the model fir stats, what does it mean when there is big gap between the min (35%) and max (96%)? Does it mean that ~ 35% of the data fits the model as good as a 96%? Also, is the read error rate accurate?

Many thanks for your help. Magdy

From: Michael Schatz notifications@github.com Reply-To: schatzlab/genomescope reply@reply.github.com Date: Monday, August 26, 2019 at 4:35 PM To: schatzlab/genomescope genomescope@noreply.github.com Cc: Magdy Alabady malabady@uga.edu, Author author@noreply.github.com Subject: Re: [schatzlab/genomescope] PacBio Sequel II data (#24)

[External Sender] FYI - The preprint is now available on the biorxiv: https://www.biorxiv.org/content/10.1101/747568v1

Cheers

Mike

On Mon, Aug 26, 2019 at 12:34 PM Michael Schatz michael.schatz@gmail.com

wrote:

The preprint is currently being screened, but should be available later today or tomorrow. Yes, there are four plots now because we do a mathematically transformation to the data that helps with highly heterozygous samples. But the results will be identical on the transformed and original data (we always fit to the transformed data, and just display the results twice for visualization purposes)

Good luck!

Mike

On Mon, Aug 26, 2019 at 12:24 PM Malabady notifications@github.com wrote:

Thanks a lot Mike! this is looking much better & is within the expectation for sure. Do you have a link to the preprint? I need to read about the interpretation of the new plots (coverageFrequency vs coverage). I see that results are the same in both profiles (Frequency Vs Coverage & FrequencyCoverage Vs Coverage). Is this to be expected?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub < https://github.com/schatzlab/genomescope/issues/24?email_source=notifications&email_token=AABP345JHNGALX6B2MAJ2ZDQGP7TZA5CNFSM4IPGUJ6KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5E4FFY#issuecomment-524927639>,

or mute the thread < https://github.com/notifications/unsubscribe-auth/AABP347BWHLCCPFSGZ4DK5DQGP7TZANCNFSM4IPGUJ6A>

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub< https://github.com/schatzlab/genomescope/issues/24?email_source=notifications&email_token=ACHNEMB64H545M6N3NY4EFLQGQ5B7A5CNFSM4IPGUJ6KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5FTDCI#issuecomment-525021577>, or mute the thread< https://github.com/notifications/unsubscribe-auth/ACHNEMDVGLKRGEJSATTCBPTQGQ5B7ANCNFSM4IPGUJ6A>.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/schatzlab/genomescope/issues/24?email_source=notifications&email_token=AABP342VKTYVUAY3JBHWLW3QG7CPTA5CNFSM4IPGUJ6KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5OM3RY#issuecomment-526175687, or mute the thread https://github.com/notifications/unsubscribe-auth/AABP342EHAD4B73APPYOJSTQG7CPTANCNFSM4IPGUJ6A .

schatzlab / genomescope

PacBio Sequel II data #24