Use multiple genome builds

bschilder commented 2 years ago

Currently I think the same blacklist arg is used for filtering both the reference and peakfiles.

But that means if your reference is hg19, and your peakfiles is hg38 (or vice versa) at least one of them will be filtered incorrectly.

There may be other parts in a code where this is an issue, e.g. anywhere that genome_build if applied to both reference and peakfiles.

To resolve this, you could take a named list a genome:

genome <- list("reference"="hg38", "peakfiles"="hg19")

If only a single character string is supplied, you could assume the users want to use for both reference and peakfiles, but should produce a message indicating this so it's clear.

genome <- "hg19"

NathanSkene commented 2 years ago

Well spotted!

From: Brian M. Schilder @.> Sent: 13 May 2022 16:15 To: neurogenomics/EpiCompare @.> Cc: Subscribed @.***> Subject: [neurogenomics/EpiCompare] Use multiple genome builds (Issue #61)

This email from @.*** originates from outside Imperial. Do not click on links and attachments unless you recognise the sender. If you trust the sender, add them to your safe senders listhttps://spam.ic.ac.uk/SpamConsole/Senders.aspx to disable email stamping for this address.

Currently I think the same blacklist arg is used for filtering both the reference and peakfiles.

But that means if your reference is hg19, and your peakfiles is hg38 (or vice versa) at least one of them will be filtered incorrectly.

There may be other parts in a code where this is an issue, e.g. anywhere that genome_build if applied to both reference and peakfiles.

To resolve this, you could take a named list a genome:

genome <- list("reference"="hg38", "peakfiles"="hg19")

If only a single character string is supplied, you could assume the users want to use for both reference and peakfiles, but should produce a message indicating this so it's clear.

genome <- "hg19"

— Reply to this email directly, view it on GitHubhttps://github.com/neurogenomics/EpiCompare/issues/61, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AH5ZPEZ5Q3EETMFGSQEZGULVJZWXLANCNFSM5V3WBYAQ. You are receiving this because you are subscribed to this thread.Message ID: @.***>

serachoi1230 commented 2 years ago

I thought the genome build used for reference and peakfiles had to be the same to compare them (i.e. calculate percentage of overlapping peaks etc)? So for example, if one is hg18 and other hg38, the percentage overlap wouldn't be accurate? What do you think?

bschilder commented 2 years ago

That's exactly it though, reference and peakfiles are not always on the same genome build when users provide them as input. SoEpiCompare should ideally handle any liftover to make them comparable. Otherwise, the user has to liftover one to the other's genome build themselves.

bschilder commented 2 years ago

Both reference and peakfiles get passed through tidy_peakfile() (which takes blacklist as an arg) at the beginning of the rmarkdown. So we need to filter each file with the correct version of blacklist (hg19 or hg38) matched to build of the input reference or peaklist file being supplied as the other arg.

https://github.com/neurogenomics/EpiCompare/blob/3f59ce8e736d75ab3c0768fa617237dfffb8e696/inst/markdown/EpiCompare.Rmd#L56

https://github.com/neurogenomics/EpiCompare/blob/3f59ce8e736d75ab3c0768fa617237dfffb8e696/inst/markdown/EpiCompare.Rmd#L61

Working on this now.

serachoi1230 commented 2 years ago

Ahhh right, I just assumed users would make sure that the reference and peakfiles both have the same genome build - but I see what you mean, this is definitely a good idea. Has this been done, or should I have a go now?

bschilder commented 2 years ago

Yeah, for example, any of our data generated by the nf-core/cutandrun pipeline has been aligned to hg38, but all our reference data from ENCODE is in hg19.

I got pretty far with this yesterday, so I'll try finishing this up today. If there's anything left to do I'll let you know.

serachoi1230 commented 2 years ago

I've actually been using reference data from ENCODE that's in hg38 -in case you haven't seen these

bschilder commented 2 years ago

Ohhh, good to know! Please to add this to the documentation for any ENCODE datasets included in EpiCompare. Screenshot 2022-05-18 at 11 19 42

bschilder commented 2 years ago

Now implemented:

Overhaul how EpiCompare handles genome builds:
- New argument genome_build_output allows users to specify which genome build to standardise all inputs to.
- genome_build can now take a named list to specify different genome builds for peakfiles, reference, and blacklist.
- Added functions to parse and validate all genome build-related arguments.

See the NEWS file for all updates/bug fixes.

bschilder commented 2 years ago

Updated the documentation as well: https://neurogenomics.github.io/EpiCompare/reference/EpiCompare.html

Screenshot 2022-05-18 at 14 55 21

neurogenomics / EpiCompare

Use multiple genome builds #61