open2c / pairtools

Extract 3D contacts (.pairs) from sequencing alignments
MIT License
104 stars 32 forks source link

Fragment-level analysis #68

Open Phlya opened 6 years ago

Phlya commented 6 years ago

Discussion to address https://github.com/mirnylab/distiller-nf/issues/96. Since there is pairsamtools restrict, we can use that as a starting point for adding fragment-level stats. But after some thinking, it seems more complicated to implement properly than I thought.

So, how about we add another column (pair_frags_type?) to the pairs file with the classification of the pairs according to this. And then maybe the actual counting can be a part of the regular stats? Then in distiller restrict should be called before stats, if we include it there (and I think it should be there since people expect this kind of stats)?

Here is a list of the stats that need to be implemented, basing off of hiclib. What did I forget?

Phlya commented 6 years ago

As an alternative to all this, provide a way to easily plot pair frequency by distance by read direction?

nvictus commented 6 years ago

Here's an example using the dask-adapter for pairix in bioframe.

It certainly doesn't require either dask or pairix, but would need to be accumulated chunkwise. The functions to compute "contact areas" and pairs scalings were recently moved to cooltools, but they are quite short and reusable: https://github.com/mirnylab/cooltools/blob/master/cooltools/expected.py#L53

Phlya commented 6 years ago

I think we just need to tweak the output of stats a bit so that we don't don't need to do anything on the raw pairs - it already saves number of contacts at different distance with different orientations. I can make a plot like this from stats:

image

I suppose, it lacks normalisation over the areas for each bin... (This is all for mouse - with no centromeres, whether not counting over-centromeres pairs in stats directly is feasible I am not sure)

Also, if there are any ideas what went wrong with the "Bad" sample, I would be happy to hear your thoughts...

Phlya commented 6 years ago

OK, using @nvictus 's suggestion I can quite easily make this using the areas functions from cooltools and counts from stats.

image

Still not sure how to interpret the difference between samples, though (the bad one was not sequenced deeply, so the curves are not smooth).

Code, just in case it's useful.

def get_areas(d, chroms=pd.read_table('/home/s1529682/eddie/ilia/genomes/mm9/chrfile.txt', names=['chrom', 'length'])):
    bins = np.append(np.unique(d['Start'].values), [np.inf])
    areas = np.zeros(len(bins)-1)
    for i in range(len(chroms)):
        areas = areas + expected.contact_areas(bins, (0, chroms['length'][i]), (0, chroms['length'][i]))
    return areas

def get_scaling_by_orienation(stats):
    d = stats.loc[stats.index.str.startswith('dist_freq')]
    d = d.reset_index()
    d[['Distance', 'Orientation']] = d['index'].str.split('/', expand=True).iloc[:, 1:]
    d = d.drop('index', axis=1)
    d[['Start', 'End']] = d['Distance'].str.split('-', expand=True).apply(lambda x: [int(i.replace('+', '')) if i is not None else np.inf for i in x])
    d['Distance'] = np.exp((np.log(d['Start'])+np.log(d['End']))/2)
    d = d[['Orientation', 'Distance', 'Start', 'End', 'Value']]
    return d
def plot_scaling_by_orienation(stats, ax):
    d = get_scaling_by_orienation(stats)
    areas = get_areas(d)
    for orientation in '+-', '-+', '++', '--':
        sd = d[d['Orientation']==orientation]
        sd['Value']/= areas
        ax.plot(sd['Distance'], sd['Value']/d['Value'].sum(), label=orientation)
gfudenberg commented 6 years ago

Are they the same cell type etc? If so, there may be many ligation in trans in the "bad" dataset, by looking at how the P(s) bends more flat around 10^6. If this is the case & you plot the average trans level normalized appropriately, you can see the cis reaching this "noise floor", as we show in MicroC-XL supFig3 ( https://media.nature.com/original/nature-assets/nmeth/journal/v13/n12/extref/nmeth.4025-S1.pdf )

On Sat, May 19, 2018, 4:29 AM Ilya Flyamer notifications@github.com wrote:

OK, using @nvictus https://github.com/nvictus 's suggestion I can quite easily make this using the areas functions from cooltools and counts from stats.

[image: image] https://user-images.githubusercontent.com/2895034/40268105-8082e3d8-5b5f-11e8-97ba-84f90190d944.png

Still not sure how to interpret the difference between samples, though (the bad one was not sequenced deeply, so I suppose the lack of smoothness is expected).

Code, just in case it's useful.

def get_areas(d, chroms=pd.read_table('/home/s1529682/eddie/ilia/genomes/mm9/chrfile.txt', names=['chrom', 'length'])): bins = np.append(np.unique(d['Start'].values), [np.inf]) areas = np.zeros(len(bins)-1) for i in range(len(chroms)): areas = areas + expected.contact_areas(bins, (0, chroms['length'][i]), (0, chroms['length'][i])) return areas

def get_scaling_by_orienation(stats): d = stats.loc[stats.index.str.startswith('dist_freq')] d = d.reset_index() d[['Distance', 'Orientation']] = d['index'].str.split('/', expand=True).iloc[:, 1:] d = d.drop('index', axis=1) d[['Start', 'End']] = d['Distance'].str.split('-', expand=True).apply(lambda x: [int(i.replace('+', '')) if i is not None else np.inf for i in x]) d['Distance'] = np.exp((np.log(d['Start'])+np.log(d['End']))/2) d = d[['Orientation', 'Distance', 'Start', 'End', 'Value']] return d

def plot_scaling_by_orienation(stats, ax): d = get_scaling_by_orienation(stats) areas = get_areas(d) for orientation in '+-', '-+', '++', '--': sd = d[d['Orientation']==orientation] sd['Value']/= areas ax.plot(sd['Distance'], sd['Value'], label=orientation)

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/mirnylab/pairsamtools/issues/68#issuecomment-390398720, or mute the thread https://github.com/notifications/unsubscribe-auth/ASvzZbzd68_JwlOj82Ail5TqhMb4ICxRks5t0AIpgaJpZM4UFGjp .

Phlya commented 6 years ago

Thank @gfudenberg ! Yes, same cell type. This is just an example, I recently have problems like that with all libraries... Yeah, there is a lot of trans (60-70%), and yeah, scaling is flat on the right, clearly a lot of random ligations or something like that - what I mean is there are also differences on the left side, and I thought maybe they can hint at what exactly is going wrong? Because I just have no clue what is going wrong now...

gfudenberg commented 6 years ago

On Sun, May 20, 2018 at 10:17 AM, Ilya Flyamer notifications@github.com wrote:

Thank @gfudenberg https://github.com/gfudenberg ! Yes, same cell type. Yeah, there is a lot of trans (60-70%), and yeah, scaling is flat on the right, clearly a lot of random ligations or something like that - what I mean is there are also differences on the left side, and I thought maybe they can hint at what exactly is going wrong? Because I just have no clue what is going wrong now...

Hmm, I don't know if anyone has systematically looked into different failure modes & how they relate to short range orientation P(s). In part because I'd guess that a bunch of experiments that didn't work well never make it to GEO...

I've never seen a bump at 30kb before, so if that's real (and not just e.g. noise from not sequencing deeply enough) my first guess would be something went wrong with the restriction step, leaving a bunch of large fragments which then have trouble ligating. Anything funky with the size of 3C products on the gel?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/mirnylab/pairsamtools/issues/68#issuecomment-390497243, or mute the thread https://github.com/notifications/unsubscribe-auth/ASvzZZaSH5A3s83S1bbsZnAvZ8F4nlNeks5t0aVBgaJpZM4UFGjp .

Phlya commented 6 years ago

I realize that, I just thought since you guys have been analyzing other people's data maybe you encountered something like that.

Do you mean 3 kb? Digestion works fine though... Size shift after Hi-C with blunt-end ligation is hard to see on the gel sometimes, so last week I just did 3C same way as I do Hi-C to check that the reagents are working fine, and the gel looks great. image (Perhaps not the best place for an extended discussion like this...)

gfudenberg commented 6 years ago

On Sun, May 20, 2018 at 10:53 AM, Ilya Flyamer notifications@github.com wrote:

I realize that, I just thought since you guys have been analyzing other people's data maybe you encountered something like that.

Do you mean 3 kb?

Ah, yes, meant 3kb

Digestion works fine though... Size shift after Hi-C with blunt-end ligation is hard to see on the gel sometimes, so last week I just did 3C same way as I do Hi-C to check that the reagents are working fine, and the gel looks great.

I see... yes, it would be interesting to systematically look at cis/total ratio vs. various short-distance orientation statistics (e.g. ratio of self-circles to dangling ends, etc.) to see if there's some useful troubleshooting signals.

nvictus commented 6 years ago

There's an enrichment in the supposed "self-circle" fraction < 10kb with a strange "double hump". I'm not sure how unusual it is, but I would have guessed some issue with fill-in or ligation. Your gel suggests that ligation is not the problem though. Maybe it's downstream stuff like library prep or PCR.

Have you looked at @gibcus's bootcamp slides? There might be some QC ideas to try there, like digesting PCR products. He might be able to chime in too.

gibcus commented 6 years ago

Hey Nezar et al., We found self-circles a lot when we were still doing dilute Hi-C. If if the ligation is less inefficient, or there are some cross-linking issues, you may find that T4 ligase can only really ligate fractions to themselves. Do you also have high %trans interactions?

Johan

From: Nezar Abdennur notifications@github.com Sent: Sunday, May 20, 2018 2:33 PM To: mirnylab/pairsamtools pairsamtools@noreply.github.com Cc: Gibcus, Johan Johan.Gibcus@umassmed.edu; Mention mention@noreply.github.com Subject: Re: [mirnylab/pairsamtools] Fragment-level analysis (#68)

There's an enrichment in the supposed "self-circle" fraction < 10kb with a strange "double hump". I'm not sure how unusual it is, but I would have guessed some issue with fill-in or ligation. Your gel suggests that ligation is not the problem though. Maybe it's downstream stuff like library prep or PCR.

Have you looked at @gibcushttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_gibcus&d=DwMFaQ&c=WJBj9sUF1mbpVIAf3biu3CPHX4MeRjY_w4DerPlOmhQ&r=58olbnbITZR_6zBWImEHoDJH8XsNp5L7t6MHgnxC5TQ&m=yMgaf_dUKYwvYxxXtfFEE6qGLCyHO7ZpMHUWtW3OasQ&s=_Tuyre5TFOw_SJcU3xLeVYyYNnuNeN-SrrigTN9CCXw&e='s bootcamp slideshttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_hms-2Ddbmi_hic-2Ddata-2Danalysis-2Dbootcamp_blob_master_HiC-2DProtocol.pdf&d=DwMFaQ&c=WJBj9sUF1mbpVIAf3biu3CPHX4MeRjY_w4DerPlOmhQ&r=58olbnbITZR_6zBWImEHoDJH8XsNp5L7t6MHgnxC5TQ&m=yMgaf_dUKYwvYxxXtfFEE6qGLCyHO7ZpMHUWtW3OasQ&s=YS-NJOr53PQLzx00HjMJaLC2CaFSy4rfxywjicrA2CE&e=? There might be some QC ideas to try there, like digesting PCR products. He might be able to chime in too.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_mirnylab_pairsamtools_issues_68-23issuecomment-2D390501699&d=DwMFaQ&c=WJBj9sUF1mbpVIAf3biu3CPHX4MeRjY_w4DerPlOmhQ&r=58olbnbITZR_6zBWImEHoDJH8XsNp5L7t6MHgnxC5TQ&m=yMgaf_dUKYwvYxxXtfFEE6qGLCyHO7ZpMHUWtW3OasQ&s=dWYeFxrnViV32ivmhu_GTfXSQhLKA_FDHOZF7qmjFD0&e=, or mute the threadhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AIsEqHzgimAf-5FSVu4ZlM4jl9mbOrPis0ks5t0bbVgaJpZM4UFGjp&d=DwMFaQ&c=WJBj9sUF1mbpVIAf3biu3CPHX4MeRjY_w4DerPlOmhQ&r=58olbnbITZR_6zBWImEHoDJH8XsNp5L7t6MHgnxC5TQ&m=yMgaf_dUKYwvYxxXtfFEE6qGLCyHO7ZpMHUWtW3OasQ&s=94X1wuvrLhJQCID-uYwqT3YarZ1lrXgWUNR5DnhBrWU&e=.

Phlya commented 6 years ago

Hi @nvictus , @gibcus

Thanks for your ideas! Yeah, it's possible fill-in is inefficient, but I am not sure how to test this...

Trans-fraction is very high. Crosslinking is not the problem because this has happened with cells crosslinked by different people and even in different labs.

Phlya commented 6 years ago

@gfudenberg you have a good eye! The small peak at 3kb is real and present in a few other new bad data, but never in the old good ones. Here is an extremely bad dataset where it's really obvious: image

And by looking at a few of them the height of the peak clearly correlates with how flat the scaling is on the right end! WTF is this?!

golobor commented 6 years ago

wow, this looks so weird!! Btw, what are the mapping stats, i.e. fraction of duplicates, unmapped and multimappers? Does it differ much from the "good" libraries?

On 21 May 2018 at 09:53, Ilya Flyamer notifications@github.com wrote:

@gfudenberg https://github.com/gfudenberg you have a good eye! The small peak at 3kb is real and present in a few other new bad data, but never in the old good ones. Here is an extremely bad dataset where it's really obvious: [image: image] https://user-images.githubusercontent.com/2895034/40310980-6c4f5c04-5d06-11e8-9724-8ef35e88ee4b.png

And by looking at a few of them the height of the peak clearly correlates with how flat the scaling is on the right end! WTF is this?!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/mirnylab/pairsamtools/issues/68#issuecomment-390660044, or mute the thread https://github.com/notifications/unsubscribe-auth/AA3uCgeCgVltpuhg_l8iNHj43zaEdJD0ks5t0sbIgaJpZM4UFGjp .

Phlya commented 6 years ago

Yeah, there are differences - values as a fraction of total reads for this extremely bad one.

image image

golobor commented 6 years ago

huh! that's very concerning, actually! This suggests that the issue might be downstream of ligation, in sequencing. Did you check fastqc of the "bad" samples? Specifically, I'd check if you still have adapters

On 21 May 2018 at 10:32, Ilya Flyamer notifications@github.com wrote:

Yeah, there are differences - values as a fraction of total reads.

[image: image] https://user-images.githubusercontent.com/2895034/40312765-b8848716-5d0b-11e8-84f3-a5c40b26709c.png [image: image] https://user-images.githubusercontent.com/2895034/40312930-2365c96e-5d0c-11e8-9aea-8d093d18b6ca.png

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/mirnylab/pairsamtools/issues/68#issuecomment-390671911, or mute the thread https://github.com/notifications/unsubscribe-auth/AA3uChhip7a_N__ipzbNMVXvopqYsjPHks5t0tARgaJpZM4UFGjp .

Phlya commented 6 years ago

No, I haven't checked, actually... Why do you think it might be adaptors? distiller can do that for me, right, if I specify do_fastqc: True?

golobor commented 6 years ago

it's just a guess, thinking of what could bump multimappers and nulls... yes, do_fastqc does fastqc :)

On 21 May 2018 at 10:40, Ilya Flyamer notifications@github.com wrote:

No, I haven't checked, actually... Why do you think it might be adaptors? distiller can do that for me, right, if I specify do_fastqc: True?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/mirnylab/pairsamtools/issues/68#issuecomment-390674234, or mute the thread https://github.com/notifications/unsubscribe-auth/AA3uCgKYM02GOkPNYb0lQeYkqjc-zKUqks5t0tHdgaJpZM4UFGjp .

Phlya commented 6 years ago

No adaptors, but some weird stuff is going on!!! Unequal level of A and T across the whole read! And other things... If you want to have a look, here is a zip with the output, serum-RI3t is good, RI4t is bad (the first plot here on top is from these).

fastqc.zip

golobor commented 6 years ago

also, an intense GC bias...

On 21 May 2018 at 10:52, Ilya Flyamer notifications@github.com wrote:

No adaptors, but some weird stuff is going on!!! Unequal level of A and T across the whole read! And other things... If you want to have a look, here is a zip with the output, serum-RI3t is good, RI4t is bad (the first plot here on top is from these).

fastqc.zip https://github.com/mirnylab/pairsamtools/files/2023009/fastqc.zip

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/mirnylab/pairsamtools/issues/68#issuecomment-390678179, or mute the thread https://github.com/notifications/unsubscribe-auth/AA3uCkSQ2XzZWjLaWC6ytKbzM129aMrtks5t0tTJgaJpZM4UFGjp .

Phlya commented 6 years ago

(I have to say, they were sequenced on different machines)

Phlya commented 6 years ago

Also GATC signature in the beginning is way more pronounced.

Phlya commented 6 years ago

So no ideas from this?..

golobor commented 6 years ago

any chance you can bug the sequencing facility? This strongly seems like an issue of sequencing, not HiC

On 21 May 2018 at 13:13, Ilya Flyamer notifications@github.com wrote:

So no ideas from this?..

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/mirnylab/pairsamtools/issues/68#issuecomment-390720024, or mute the thread https://github.com/notifications/unsubscribe-auth/AA3uCt4qPihzg6KAvd6uHoAk2CQQeHSoks5t0vWygaJpZM4UFGjp .

Phlya commented 6 years ago

If you mean there is something wrong on the technical side with the sequencing, some bad samples were sequenced in one place, and some - in another...

But I might ask them for their opinion on what could cause such issues though.

golobor commented 6 years ago

hm, i'm confused now : (a) are there "bad" samples that don't have fastqc issues? (b) are there "good" samples that were sequenced in the same facility as the "bad" ones?

On 21 May 2018 at 15:10, Ilya Flyamer notifications@github.com wrote:

If you mean there is something wrong on the technical side with the sequencing, some samples were sequenced in one place, and some - in another...

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/mirnylab/pairsamtools/issues/68#issuecomment-390752587, or mute the thread https://github.com/notifications/unsubscribe-auth/AA3uCgCnZAx4M1eCTLAuNPRQ1HtO1ef9ks5t0xESgaJpZM4UFGjp .

Phlya commented 6 years ago

a) I haven't run fastqc on all of them b) yes

I can try to compare them more comprehensively tomorrow

golobor commented 6 years ago

my point is that all observations you showed so far are consistent with a single issue behind "bad" samples - faulty sequencing; is there any piece of evidence that contradicts this theory?

On 21 May 2018 at 15:14, Anton Goloborodko goloborodko.anton@gmail.com wrote:

hm, i'm confused now : (a) are there "bad" samples that don't have fastqc issues? (b) are there "good" samples that were sequenced in the same facility as the "bad" ones?

On 21 May 2018 at 15:10, Ilya Flyamer notifications@github.com wrote:

If you mean there is something wrong on the technical side with the sequencing, some samples were sequenced in one place, and some - in another...

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/mirnylab/pairsamtools/issues/68#issuecomment-390752587, or mute the thread https://github.com/notifications/unsubscribe-auth/AA3uCgCnZAx4M1eCTLAuNPRQ1HtO1ef9ks5t0xESgaJpZM4UFGjp .

Phlya commented 6 years ago

Both good and bad data have been produced in two different facilities, and at one point a few months ago all libraries, wherever we sequence them, started having similar issues.

Phlya commented 6 years ago

So, I ran fastqc for many of the good and bad libraries, and the only mostly consistent problem with the bad ones is very high over-representation of a particular group of K-mers in a particular place in the forward reads. This sequence looks like a part of the adaptor sequence, why it is so often found in a particular place, and only in the forward read is completely enigmatic to me, but their actual counts are pretty low and overall adaptor sequences are not very frequent, so I don't think this can be the source of these problems. image

golobor commented 6 years ago

and what about the A-vs-T bias, is it common for all "bad" datasets?

On 22 May 2018 at 08:35, Ilya Flyamer notifications@github.com wrote:

So, I ran fastqs for many of the good and bad libraries, and the only mostly consistent problem with the bad ones is very high over-representation of a particular group of K-mers in a particular place in the forward reads. This sequence looks like a part of the adaptor sequence, why it is so often found in a particular place, and only in the forward read is completely enigmatic to me, but their actual counts are pretty low and overall adaptor sequences are not very frequent, so I don't think this can be the source of these problems. [image: image] https://user-images.githubusercontent.com/2895034/40361958-fe7962ca-5dc2-11e8-8e1c-2820c0213a36.png

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/mirnylab/pairsamtools/issues/68#issuecomment-390973589, or mute the thread https://github.com/notifications/unsubscribe-auth/AA3uCt1YuHI8rHQOD276hcPi-w7EUXfHks5t1AX1gaJpZM4UFGjp .

Phlya commented 6 years ago

Well, it seems it's quite commonly present in reverse reads, but not in forward reads, whatever that means, and however that is possible... Here is an example. image image

Phlya commented 6 years ago

And yeah, similar effect in G vs C in reverse reads is also present.

golobor commented 6 years ago

I believe all of this is sufficiently concerning to contact the sequencing facility - you paid a lot of money for it, after all. Is the amount of the T->A and C->G bias correlated with how "bad" the library is?

To conclude, do you think that the issues of the "bad" libraries can be entirely related to sequencing (and not Hi-C) or there are some observations against that theory?

On 22 May 2018 at 14:48, Ilya Flyamer notifications@github.com wrote:

Well, it seems it's quite commonly present in reverse reads, but not in forward reads, whatever that means, and however that is possible... Here is an example. [image: image] https://user-images.githubusercontent.com/2895034/40383537-094abc18-5df9-11e8-9e94-94eb53eade8b.png [image: image] https://user-images.githubusercontent.com/2895034/40383552-11b11e2e-5df9-11e8-9ce7-4c68943f927a.png

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/mirnylab/pairsamtools/issues/68#issuecomment-391100421, or mute the thread https://github.com/notifications/unsubscribe-auth/AA3uCiM-1aRjB7-hx_u4nqhgGZ_NPVEvks5t1F2BgaJpZM4UFGjp .

Phlya commented 6 years ago

I will check that.

Well, as I said, I think it's unlikely that two facilities at the same time stop producing good data with similar symptoms...

Phlya commented 6 years ago

It seems like quality (scaling shallowness) is correlated to this bias, yes. But hard to say definitively.

golobor commented 6 years ago

Ah, great point. Btw, your multimappers went through the roof - more than a half of the sample. Maybe you can check them manually to see what's wrong with them?

On Tue, May 22, 2018, 15:33 Ilya Flyamer notifications@github.com wrote:

It seems like quality (scaling shallowness) is correlated to this bias, yes. But hard to say definitively.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/mirnylab/pairsamtools/issues/68#issuecomment-391113862, or mute the thread https://github.com/notifications/unsubscribe-auth/AA3uCuD5nLm5WV9JRqTs-6QC18S-4oyAks5t1GgkgaJpZM4UFGjp .

Phlya commented 6 years ago

And actually I was wrong when I said that it's not present in forward reads - reverse reads are certainly more affected, but in at least one run most samples have it in forward reads too to a different extent - which seems to correlate with just how bad they are.

Phlya commented 6 years ago

Yeah, I noticed that too, and it's present (again, to a different extent) in a few samples I checked, but not sure how to look at them - like, what exactly should I look into?

Phlya commented 6 years ago

Also, how about we switch to e.g. G-chat @golobor ? I doubt everyone is interested in this!

wbszhu commented 3 years ago

Hi, Is there any way that we can get another column (pair_frags_type?)? I really expect this kind of stats. Thanks a lot. Luzhang