Closed angelosarmen closed 4 years ago
Dear Angelos, 1) We provide blacklisted regions to be removed automatically for the mm10 genome. However, if you wish so, you can provide a bed file containing your own regions to be removed. A good repository of blacklisted regions is found on http://mitra.stanford.edu/kundaje/akundaje/release/blacklists/ (among other repositories). The option -bl permits to input your bed file. We use different blacklisted regions for each genome.
2) -reg True runs ChromA on selected regions. This run should take less than a minute and it is performed for validation purposes. We recommend running this setting when you first install ChromA to validate the installation. In addition, you can enable this setting when you have a new file to validate the input path and the correctness of the file format.
Lastly, I will give you an example of the regions we use when you enable the -reg True option for the mm10 genome. These regions contain housekeeping genes (such as Actb). regions_list = [['chr16', 32580000, 32670000], ['chr8', 105610000, 105705000], ['chr9', 106180000, 106250000], ['chr7', 45098000, 45160000], ['chr5', 142840000, 142952000], ['chr11', 100849000, 100945000], ['chr12', 85666000, 85761000], ['chr5', 32095000, 32190000], ['chr13', 30732000, 30825000], ['chr3', 94303000, 94399000]]
3) Would it be possible to clarify what "N average replicates" mean? I do not understand how the word average is used in this context. I have to double-check but from the top of my head, I believe that Execution time scales as N + 1 . This is due to the fact that we keep in memory and compute a state-space model for each replicate plus the consensus.
Thanks,
Mariano
Dear Mariano,
Thank you for your prompt reply.
The option -bl permits to input your bed file. We use different blacklisted regions for each genome.
Could you please double-check this? In line 55 (and others) of ChromA
type=bool
is used, but this converts any non-empty string to True
(see also https://bugs.python.org/issue24754). And then the resulting blacklisted
boolean variable is used to determine whether to use the mm10 blacklist in data_handly.py
. Therefore, it seems that I have to apply the blacklist of my choice externally and use -bl=
in order for ChromA to not use the mm10 blacklist.
Would it be possible to clarify what "N average replicates" mean? I do not understand how the word average is used in this context.
Fair enough, I was referring to averaged-sized BAM files.
Dear Angelos, I have updated the blacklisted routine to include new genomes. Now, you can blacklist regions form the hg38, hg19, mm10 genome automatically. For any other genome, pass the path of the blacklisted bed file with the bl command as:
-bl "/path/regions_to_blacklist_tab_separated.bed"
Thanks,
Mariano
Please, Angelos, give it a try (v2.1.1) and should you need any help, send me your email and we can correspond.
Thank you very much for this, I'll give it a try.
Hi,
Thank you for developing ChromA. I have a few questions:
I understand that ChromA doesn't remove chrY and chrM reads (so those must be removed beforehand) but it does remove blacklisted regions (
-bl
argument). On inspection of the code, however, it seems that the mm10 blacklist is hardcoded (data_handle.py
, line 377).What does
-reg True
do? What regions are selected?Is it possible to give a rough estimate of how long it takes to run Consensus ChromA with N average replicates? Does execution time scale linearly with N?