ryanlayer / giggle

Interval data structure
MIT License
224 stars 29 forks source link

giggle: Could not open human_hm_sort/3651_sort_peaks.narrowPeak.bed.gz. #63

Closed MenglinC closed 2 months ago

MenglinC commented 2 years ago

To who it may concerned,

When I was trying to build the index using the cistrome histone modification peak files, giggle report this errors,which says "could not open XXX."

giggle index -i "human_hm_sort/*.gz"  -o human_hm_index -f -s

the error information:

Could not open file 'human_hm_sort/3651_sort_peaks.narrowPeak.bed.gz' giggle: Could not open human_hm_sort/3651_sort_peaks.narrowPeak.bed.gz.

I have had sort these bed files using the script sort_bed. I think this error information is too vague.How can I fix this problem? Thanks for your kind advice.

Xiu

MenglinC commented 2 years ago

Hi, I am sorry to trouble you again.This error have still given me great confusion in these days although I try to use many mesures to fix it.I believe GIGGLE's developer still have the responsibility to maintain it after its birth. I hope I can receive the help from you. Here is my code:

 /home/xxzhang/workplace/software/giggle/scripts/sort_bed "./named/[A-J]*" ./named_sort/ 30
time giggle index -i "./named_sort/*gz" -o ./named_sort_b -s -f

and this is the error:

Could not open file './named_sort/H3K27ac_H1_Embryonic_Stem_CellEmbryo.18.bed.gz' giggle: Could not open ./named_sort/H3K27ac_H1_Embryonic_Stem_CellEmbryo.18.bed.gz.

I am using the cistrome histone mark data in my project and I expect some enrichment results using GIGGLE. However,I continuely meet with the same error,it says "could not open file……"and without any other token. I have noticed the previous issues of GIGGLE.I made some trials but still failed. (1) the cistrome files for indexing I have checked they are both tab split. (2) I have set the ulimit -c 100000 (3) I also put all the files in one fold and operate according to the above codes. (4) I try to use other files to index it meet with the same error

I really do not know what to do next.You advice for me is really important.This problem has puzzled me for nearly a mouth. if this problem can not fix,I may try to use other tools to solve my problems.

Thanks!

Xiu

MenglinC commented 2 years ago

Hi, everyone! I solved this problem successfuly. Maybe the reason is the GIGGLE requires more memory or other computational resources that exceed the usual user limitation. The way I take to solve this problem is to split the files into different part and bulid index sepeately.And then I combine the result files as the final results. The detailed code is as follows,

(base) [xxzhang@cu08 human_histone_mark]$ mkdir H3K27me3
(base) [xxzhang@cu08 human_histone_mark]$ cp ./named_sort/H3K27me3* ./H3K27me3/
(base) [xxzhang@cu08 human_histone_mark]$ cd ./H3K27me3/
(base) [xxzhang@cu08 H3K27me3]$ mkdir named_H3K27me3_s1
(base) [xxzhang@cu08 H3K27me3]$ mkdir named_H3K27me3_s2
(base) [xxzhang@cu08 H3K27me3]$ mkdir named_H3K27me3_s3
(base) [xxzhang@cu08 H3K27me3]$ ls -Q ./  |head -500 |xargs -i mv ./{} ./named_H3K27me3_s1/
ls: write error: Broken pipe
(base) [xxzhang@cu08 H3K27me3]$ ls -Q ./  |head -500 |xargs -i mv ./{} ./named_H3K27me3_s2/
ls: write error: Broken pipe
(base) [xxzhang@cu08 H3K27me3]$ mv ./*.gz ./named_H3K27me3_s3/
(base) [xxzhang@cu08 H3K27me3]$ giggle index -i "./named_H3K27me3_s1/*" -o ./named_H3K27me3_s1_index -s -f
Indexed 5884451 intervals.
(base) [xxzhang@cu08 H3K27me3]$ giggle index -i "./named_H3K27me3_s2/*" -o ./named_H3K27me3_s2_index -s -f
Indexed 4270175 intervals.
(base) [xxzhang@cu08 H3K27me3]$ giggle index -i "./named_H3K27me3_s3/*" -o ./named_H3K27me3_s3_index -s -f
Indexed 4924467 intervals.
(base) [xxzhang@cu08 H3K27me3]$ cp ../Hs_repeat.bed.gz ./
(base) [xxzhang@cu08 H3K27me3]$ giggle search -i ./named_H3K27me3_s1_index/ -q Hs_repeat.bed.gz -s >Hs_repeat.bed.gz.giggle.H3K27me3_s1.result
(base) [xxzhang@cu08 H3K27me3]$ giggle search -i ./named_H3K27me3_s2_index/ -q Hs_repeat.bed.gz -s >Hs_repeat.bed.gz.giggle.H3K27me3_s2.result
(base) [xxzhang@cu08 H3K27me3]$ giggle search -i ./named_H3K27me3_s3_index/ -q Hs_repeat.bed.gz -s >Hs_repeat.bed.gz.giggle.H3K27me3_s3.result
(base) [xxzhang@cu08 H3K27me3]$ cat Hs_repeat.bed.gz.giggle.H3K27me3_s* >Hs_repeat.bed.gz.giggle.H3K27me3_all.result
(base) [xxzhang@cu08 H3K27me3]$ awk '$8>0' Hs_repeat.bed.gz.giggle.H3K27me3_all.result >repeat_positive.H3K27me3.result

This solution is so complex but it can fix this problem. Hope this may give your some clues for your own problems.

mchowdh200 commented 2 months ago

Hi Xiu, sorry we missed your issue. We ran into similar issues in the past for large indices and employed a similar sharding strategy. If you look under the sharding directory in this repository, you will find a script that can be used to build and search a sharded giggle index (along with instructions for running).