skrakau / PureCLIP

Capturing protein-RNA interaction footprints from single-nucleotide CLIP-seq data
GNU General Public License v3.0
29 stars 8 forks source link

segmentation error #18

Closed nick-webster closed 4 years ago

nick-webster commented 4 years ago

Hi, I get a segmentation error when running pure clip. Is this just a memory issue? I have 32 GB RAM.

here is the log:

(base) nickwebster@Nicks-iMac-Pro Eclipse_results % pureclip -i Q0013_IP1_1_S15_L006_R1_001.adapterTrim.round2.rmRep.sorted.rmDup.sorted.bam -bai Q0013_IP1_1_S15_L006_R1_001.adapterTrim.round2.rmRep.sorted.rmDup.sorted.bam.bai -g hg38.fa -iv 'chr1;chr2;chr3;' -nt 8 -o Pureclip.clink.1.bed Protein-RNA crosslink site detection

Created look-up table for values from -2000 to 0 with step size 0.00333333 (size: 600000). Loading reference ... Load user specified genomic intervals for learning ... DBL_MIN_10_EXP: -307 LDBL_MIN_10_EXP: -4931 computed prior_kdeThreshold: 0.0558519 Use bandwidth: 50 Use KDE threshold: 0.00807885 Use bandwidth to estimate n: 50 Learn HMM parameters: Compute SLR ... Compute KDEs ... Estiamte Ns ... 1 Prior ML estimation of density distribution parameters using predefined cutoff ... Initial gamma1: Run simplex algorithm using different start points. updateThetaAndK... kMax: 1 Note: fixed shape parameter k to: 1 Note: fixed shape parameter k to: 1 Note: fixed shape parameter k to: 1 Note: fixed shape parameter k to: 1 Updated parameters of gamma1 distribution using GSL simplex2: GAMMA b0:-4.09643 mean:0.016632 k:1 tp:0.00807885

Initial gamma2: Run simplex algorithm using different start points. updateThetaAndK... kMax: 10 Note: fixed shape parameter k to: 1 Note: fixed shape parameter k to: 1 Note: fixed shape parameter k to: 1 Updated parameters of gamma2 distribution using GSL simplex2: GAMMA b0:-1.29755 mean:0.273201 k:1 tp:0.00807885

initial transition frequencies: 0: 39117476 8766 12529 28
1: 8762 20 24 1
2: 12538 20 2321528 27648
3: 29 1 27647 1973

Build HMM ... Transition probabilitites 0: 0.999455 0.000223972 0.000320117 7.15403e-07
1: 0.99489 0.00227092 0.00272511 0.000113546
2: 0.00530881 8.46835e-06 0.982976 0.0117067
3: 0.000978078 3.37268e-05 0.932445 0.066543

Baum-Welch ... learn binomial parameter .. 0th iteration

. . . . . .

.. 6th iteration computeEmissionProbs() computeStatePosteriorsFB() updateDensityParams() GAMMA b0:-3.99349 mean:0.0184353 k:1 tp:0.00807885

GAMMA b0:-1.21561 mean:0.296528 k:1.00332 tp:0.00807885

Transition probabilitites 0: 0.999749 3.2158e-12 0.000251022 4.14413e-52
1: 1 3.42608e-60 2.72216e-39 5.20135e-48
2: 0.00478297 3.36741e-50 0.994429 0.000787655
3: 7.24451e-50 3.49711e-70 0.744961 0.255039

ZTBIN p:0.0106422

ZTBIN p:0.157625

.. 7th iteration computeEmissionProbs() computeStatePosteriorsFB() updateDensityParams() Convergence ! learn gamma parameter .. 0th iteration computeEmissionProbs() computeStatePosteriorsFB() updateDensityParams() Note: fixed shape parameter k to: 1 Convergence ! Apply learned parameters to whole dataset ... chr11 chr10 chr1 Get preprocessed intervals: Get preprocessed intervals: Get preprocessed intervals:

Compute KDEs ... Compute KDEs ... Compute KDEs ... Estiamte Ns ... 1 Estiamte Ns ... 1 Estiamte Ns ... 1 Apply learned HMM: applyParameters Apply learned HMM: applyParameters Apply learned HMM: applyParameters chr11_KI270721v1_random Get preprocessed intervals: chr12 Get preprocessed intervals: Compute KDEs ... Estiamte Ns ... 1 chr13 Get preprocessed intervals: Compute KDEs ... Estiamte Ns ... 1 Apply learned HMM: applyParameters chr14 Get preprocessed intervals: Compute KDEs ... Estiamte Ns ... 1 Apply learned HMM: applyParameters Apply learned HMM: chr14_GL000009v2_random Get preprocessed intervals: applyParameters chr14_GL000225v1_random Get preprocessed intervals: chr14_KI270722v1_random Get preprocessed intervals: chr14_GL000194v1_random Get preprocessed intervals: chr14_KI270723v1_random Get preprocessed intervals: chr14_KI270724v1_random Get preprocessed intervals: chr14_KI270725v1_random Get preprocessed intervals: chr14_KI270726v1_random Get preprocessed intervals: chr15 Get preprocessed intervals: Compute KDEs ... Estiamte Ns ... 1 chr15_KI270727v1_random Get preprocessed intervals: chr16 Get preprocessed intervals: zsh: segmentation fault pureclip -i -bai -g hg38.fa -iv 'chr1;chr2;chr3;' -nt 8 -o (base) nickwebster@Nicks-iMac-Pro Eclipse_results %

skrakau commented 4 years ago

Hi, good question, likely it is a memory issue, although I am wondering why it occurs at that step with those parameters. You could check this by reducing the -nta parameter for example to 1, which should indirectly also reduce the required memory or run it on a system with more memory.

Let me know if the problem persists.

nick-webster commented 4 years ago

Thanks, I’ll give it a shot.

Nick

Nick Webster Sent from my iPhone.

On Jul 1, 2020, at 1:30 AM, Sabrina Krakau notifications@github.com wrote:



Hi, good question, likely it is a memory issue, although I am wondering why it occurs at that step with those parameters. You could check this by reducing the -nta parameter for example to 1, which should indirectly also reduce the required memory or run it on a system with more memory.

Let me know if the problem persists.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/skrakau/PureCLIP/issues/18#issuecomment-652275422, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AD3AGXBMAHK7ZALBBW2LSQ3RZLX3HANCNFSM4OJSECTA.

nick-webster commented 4 years ago

Ran it again with -nta 1 but still failed:

Convergence ! Apply learned parameters to whole dataset ... chr1 Get preprocessed intervals: Compute KDEs ... Estiamte Ns ... 1 Apply learned HMM: applyParameters chr10 Get preprocessed intervals: Compute KDEs ... Estiamte Ns ... 1 Apply learned HMM: applyParameters chr11 Get preprocessed intervals: Compute KDEs ... Estiamte Ns ... 1 Apply learned HMM: applyParameters chr11_KI270721v1_random Get preprocessed intervals: chr12 Get preprocessed intervals: Compute KDEs ... Estiamte Ns ... 1 Apply learned HMM: applyParameters chr13 Get preprocessed intervals: Compute KDEs ... Estiamte Ns ... 1 Apply learned HMM: applyParameters chr14 Get preprocessed intervals: Compute KDEs ... Estiamte Ns ... 1 Apply learned HMM: applyParameters chr14_GL000009v2_random Get preprocessed intervals: chr14_GL000225v1_random Get preprocessed intervals: chr14_KI270722v1_random Get preprocessed intervals: chr14_GL000194v1_random Get preprocessed intervals: chr14_KI270723v1_random Get preprocessed intervals: chr14_KI270724v1_random Get preprocessed intervals: chr14_KI270725v1_random Get preprocessed intervals: chr14_KI270726v1_random Get preprocessed intervals: chr15 Get preprocessed intervals: Compute KDEs ... Estiamte Ns ... 1 Apply learned HMM: applyParameters chr15_KI270727v1_random Get preprocessed intervals: chr16 Get preprocessed intervals: zsh: segmentation fault pureclip -i -bai -g hg38.fa -iv 'chr1;chr2;chr3;' -nt 8 -nta 1 -o (base) nickwebster@Nicks-iMac-Pro Eclipse_results %

Seems to stop at the same point every time. Could it be something to do with chr16 in the genome file? The reads were aligned against hg38 but it might have been an earlier release. Perhaps I should tell it to learn on chr16.

Nick


Nicholas Webster, Ph.D., M.A. Professor of Medicine Chief, Division of Endocrinology and Metabolism Associate Director for Shared Resources, Moores Cancer Center University of California, San Diego 9500 Gilman Drive La Jolla CA 92093 (858) 534-6275

On Jul 1, 2020, at 1:30 AM, Sabrina Krakau notifications@github.com<mailto:notifications@github.com> wrote:

Hi, good question, likely it is a memory issue, although I am wondering why it occurs at that step with those parameters. You could check this by reducing the -nta parameter for example to 1, which should indirectly also reduce the required memory or run it on a system with more memory.

Let me know if the problem persists.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/skrakau/PureCLIP/issues/18#issuecomment-652275422, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AD3AGXBMAHK7ZALBBW2LSQ3RZLX3HANCNFSM4OJSECTA.

nick-webster commented 4 years ago

So I set chr 15, 16 and 17 as learning set. And got the following error:

Created look-up table for values from -2000 to 0 with step size 0.00333333 (size: 600000). Loading reference ... Load user specified genomic intervals for learning ... DBL_MIN_10_EXP: -307 LDBL_MIN_10_EXP: -4931 computed prior_kdeThreshold: 0.0558519 Use bandwidth: 50 Use KDE threshold: 0.00807885 Use bandwidth to estimate n: 50 Learn HMM parameters: zsh: segmentation fault pureclip -i -bai -g hg38.fa -iv 'chr15;chr16;chr17;' -nt 8 -nta 1 -o

So maybe something to do with chr16?

Nick


Nicholas Webster, Ph.D., M.A. Professor of Medicine Chief, Division of Endocrinology and Metabolism Associate Director for Shared Resources, Moores Cancer Center University of California, San Diego 9500 Gilman Drive La Jolla CA 92093 (858) 534-6275

On Jul 1, 2020, at 1:30 AM, Sabrina Krakau notifications@github.com<mailto:notifications@github.com> wrote:

Hi, good question, likely it is a memory issue, although I am wondering why it occurs at that step with those parameters. You could check this by reducing the -nta parameter for example to 1, which should indirectly also reduce the required memory or run it on a system with more memory.

Let me know if the problem persists.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/skrakau/PureCLIP/issues/18#issuecomment-652275422, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AD3AGXBMAHK7ZALBBW2LSQ3RZLX3HANCNFSM4OJSECTA.

nick-webster commented 4 years ago

Hi Sabrina, Managed to solve the issue. It was something to do with the .fa file used. Seems fine now and I ran pure clip Lon our eCLIP data. Another issue came up though when I tried to use the CL-motifs for bias correction. When I run compute_CLmotif_scores.sh with your motif files downloaded from GitHub, it loads the motifs in the file but crashes at the end.

nickwebster@Nicks-iMac-Pro Eclipse_results % compute_CLmotif_scores.sh mm10.fa Q0013_IP1_1_S15_L006_R1_001.adapterTrim.round2.rmRep.sorted.rmDup.sorted.bam motifs.xml motifs.txt fimo_clmotif_occurences.bed
Using temporary directory '/var/folders/v9/tzx_rfsn28v2znvg94csf24m0000gn/T/tmp.2sRVeCSP' Load reference file ... Load candidate sites from BED file ... Extract sequence for windows around candidate sites ... Run FIMO ... Using motif +TTTTNV of width 6. Computing q-values. Estimating pi_0 from a uniformly sampled set of 10000 p-values. Estimating pi_0. Estimated pi_0=0.961023 Using motif +TCTTTV of width 6. Computing q-values. Estimating pi_0 from a uniformly sampled set of 10000 p-values. Estimating pi_0. Estimated pi_0=0.958844 Using motif +TTTTWT of width 6. Computing q-values. Estimating pi_0 from a uniformly sampled set of 10000 p-values. Estimating pi_0. Estimated pi_0=0.943 Using motif +TBATCA of width 6. Computing q-values. Estimating pi_0 from a uniformly sampled set of 10000 p-values. Estimating pi_0. Estimated pi_0=0.984625 awk: can't open file BEGIN{FS=OFS="\t"} {print $2, NR}; source line number 1

I looked at the sh script and it seems to be at the final step:

replace motifs by its ID

get DREME motifs

awk 'BEGIN{FS="\t| "; OFS="\t"} $1 == "MOTIF" {print $1, $2};' "$MOTIFS_TXT" > "$TEMP_DIR/dreme_motifs.txt"

add id

awk -v n=$(wc -l < "$TEMP_DIR/dreme_motifs.txt") 'BEGIN{FS=OFS="\t"} {print $2, NR};' "$TEMP_DIR/dreme_motifs.txt" > "$TEMP_DIR/dreme_motifs.id.txt"

replace motif with its id

awk 'BEGIN{FS=OFS="\t"} NR==FNR{a[$1]=$2} NR>FNR{$4=a[$4];print};' "$TEMP_DIR/dreme_motifs.id.txt" "$TEMP_DIR/FIMO_CL_MOTIFS/fimo.origPos.sn.u.txt" > "$BED_OUT"

Any thoughts what might be causing it?

Thanks, Nick

nick-webster commented 4 years ago

Hi Sabrina,

I was running the compute_CLmotifs_scores.sh script on the mm10 genome but it would not complete. So I opened the script and have been running each step by hand. I am running meme 4.11.2. Fimo runs fine. Then run series of awk commands.

awk 'BEGIN{FS=OFS="\t"} $1 !~ "#pattern" && $6 > 0 {print $2, ($3-1), ($4-1), $1, $6, $5};' "$TEMP_DIR/FIMO_CL_MOTIFS/fimo.txt" > "$TEMP_DIR/FIMO_CL_MOTIFS/fimo.tmp.txt" runs OK

awk 'BEGIN{FS=""; OFS="\t"; ORS=""} {contig=$2; start=$(NF-2); end = $(NF-1); rem=$NF; print contig; for (i=3;i<=NF-3;i++){print "" $i} print "\t" start, end, rem"\n"};' "$TEMP_DIR/FIMO_CL_MOTIFS/fimo.tmp.txt" > "$TEMP_DIR/FIMO_CL_MOTIFS/fimo.tmp2.txt" this one does not run but gives the following error:

awk: trying to access out of range field -1 input record number 1, file fimo.tmp.txt source line number 1

Do you have an output file so I can see what it is trying to do? The fimo.txt file has format:

chr14 65975731-65975742(+) 0 5 TCTTTV 10.9867 +
chr16 52450637-52450648(-) 0 5 TCTTTV 10.9867 +
chr18 12972287-12972298(+) 1 6 TCTTTV 10.9867 +

It looks like you are trying to split the sequence info into separate start, end and strand columns but I don't see any FS="_" to be converted to tabs.

Thanks, Nick

skrakau commented 4 years ago

HI Nick, sorry for answering that late (I am not working on this topic anymore and first have to look again into this). I will try to do it tomorrow.

nick-webster commented 4 years ago

OK. No problem. I am installing a newer version of meme to see if that makes a difference.

Thanks, Nick


Nicholas Webster, Ph.D., M.A. Professor of Medicine Chief, Division of Endocrinology and Metabolism Associate Director for Shared Resources, Moores Cancer Center University of California, San Diego 9500 Gilman Drive La Jolla CA 92093 (858) 534-6275

On Jul 16, 2020, at 9:49 AM, Sabrina Krakau notifications@github.com<mailto:notifications@github.com> wrote:

HI Nick, sorry for answering that late (I am not working on this topic anymore and first have to look again into this). I will try to do it tomorrow.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/skrakau/PureCLIP/issues/18#issuecomment-659536472, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AD3AGXAQD22YOKK2B7Y7CHTR34VPZANCNFSM4OJSECTA.

skrakau commented 4 years ago

I opened a new issue for the problem with the CL-motif score computation and will close this.