Closed bschilder closed 2 years ago
Well spotted!
From: Brian M. Schilder @.> Sent: 13 May 2022 16:15 To: neurogenomics/EpiCompare @.> Cc: Subscribed @.***> Subject: [neurogenomics/EpiCompare] Use multiple genome builds (Issue #61)
This email from @.*** originates from outside Imperial. Do not click on links and attachments unless you recognise the sender. If you trust the sender, add them to your safe senders listhttps://spam.ic.ac.uk/SpamConsole/Senders.aspx to disable email stamping for this address.
Currently I think the same blacklist arg is used for filtering both the reference and peakfiles.
But that means if your reference is hg19, and your peakfiles is hg38 (or vice versa) at least one of them will be filtered incorrectly.
There may be other parts in a code where this is an issue, e.g. anywhere that genome_build if applied to both reference and peakfiles.
To resolve this, you could take a named list a genome:
genome <- list("reference"="hg38", "peakfiles"="hg19")
If only a single character string is supplied, you could assume the users want to use for both reference and peakfiles, but should produce a message indicating this so it's clear.
genome <- "hg19"
— Reply to this email directly, view it on GitHubhttps://github.com/neurogenomics/EpiCompare/issues/61, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AH5ZPEZ5Q3EETMFGSQEZGULVJZWXLANCNFSM5V3WBYAQ. You are receiving this because you are subscribed to this thread.Message ID: @.***>
I thought the genome build used for reference
and peakfiles
had to be the same to compare them (i.e. calculate percentage of overlapping peaks etc)? So for example, if one is hg18 and other hg38, the percentage overlap wouldn't be accurate? What do you think?
That's exactly it though, reference
and peakfiles
are not always on the same genome build when users provide them as input. SoEpiCompare
should ideally handle any liftover to make them comparable. Otherwise, the user has to liftover one to the other's genome build themselves.
Both reference and peakfiles get passed through tidy_peakfile()
(which takes blacklist as an arg) at the beginning of the rmarkdown. So we need to filter each file with the correct version of blacklist (hg19 or hg38) matched to build of the input reference
or peaklist
file being supplied as the other arg.
Working on this now.
Ahhh right, I just assumed users would make sure that the reference
and peakfiles
both have the same genome build - but I see what you mean, this is definitely a good idea. Has this been done, or should I have a go now?
Yeah, for example, any of our data generated by the nf-core/cutandrun
pipeline has been aligned to hg38, but all our reference data from ENCODE is in hg19.
I got pretty far with this yesterday, so I'll try finishing this up today. If there's anything left to do I'll let you know.
I've actually been using reference data from ENCODE that's in hg38 -in case you haven't seen these
Ohhh, good to know! Please to add this to the documentation for any ENCODE datasets included in EpiCompare.
Now implemented:
genome_build_output
allows users to specify
which genome build to standardise all inputs to. genome_build
can now take a named list to specify different genome builds
for peakfiles
, reference
, and blacklist
. See the NEWS file for all updates/bug fixes.
Updated the documentation as well: https://neurogenomics.github.io/EpiCompare/reference/EpiCompare.html
Currently I think the same
blacklist
arg is used for filtering both thereference
andpeakfiles
.But that means if your
reference
is hg19, and yourpeakfiles
is hg38 (or vice versa) at least one of them will be filtered incorrectly.There may be other parts in a code where this is an issue, e.g. anywhere that
genome_build
if applied to bothreference
andpeakfiles
.To resolve this, you could take a named list a
genome
:If only a single character string is supplied, you could assume the users want to use for both
reference
andpeakfiles
, but should produce a message indicating this so it's clear.