Removing short overlapping repeats

MaestSi commented 4 years ago

Hi, I am running NCRF for finding 4 different repeats in my Nanopore reads, and I have a summary file where all the repeats for the different motifs are listed. I noticed that some of the repeats are overlapped, therefore I would like to end up with a summary file that does not have any overlapping repeats. For this purpose I tried out ncrf_resolve_overlaps.py script, and I have that all un-overlapped alignments and overlapping groups are written to a file, as expected. However, this is not exactly what I would like to do, as I would like to keep also one alignment (the longest) among each overlapping group. Moreover, repeats in this file are not sorted anymore by read name and position. Do you think there is a way for me to obtain the desired output with NCRF, without having to manuallly delete unwanted repeats? Thanks in advance, Simone

rsharris commented 4 years ago

Could you point me at a small example that I can try out? If you aren't comfortable posting it here, mail me a link at ... rsharris at bx dot psu dot edu.

For clarification, when you wrote

I would like to keep also one alignment (the longest) among each overlapping group by "alignment" you mean the alignment summary (a line the output of ncrf_summary or ncrf_resolve_overlaps), not the full base-by-base alignment output by NCRF. Right?

By "keep", what I think you mean is as follows. If there's an alignment of motif1 that overlaps one of motif2, and the motif1 alignment was the longer, then that motif1 alignment should be output in the same file with all the non-overlapping motif1 alignments, and the motif2 alignment (plus any others in this overlap group) should be output to the overlaps file. Right?

As for sorting, which output isn't sorted? Without looking at the code, I'm not sure I made any special effort in ncrf_resolve_overlaps to create an output order. For example, the input motif summaries might have been sorted by decreasing score. I could output those in the same order they were input (this requires I save the whole file inside the script, essentially). But I don't see how I could do that with the overlapped items (because I don't know what order is wanted). I'm surprised I don't currently output those as positionally sorted, since internal to the script it has to be doing that.

Maybe the best solution for sorting is that I create ncrf_sort_summary to do the same sort (no pun intended) of things ncrf_sort does.

MaestSi commented 4 years ago

Hi, I just shared a small example dataset along with the commands I used.

by "alignment" you mean the alignment summary (a line the output of ncrf_summary or ncrf_resolve_overlaps), not the full base-by-base alignment output by NCRF. Right?

Yes, I mean the alignment summary.

By "keep", what I think you mean is as follows. If there's an alignment of motif1 that overlaps one of motif2, and the motif1 alignment was the longer, then that motif1 alignment should be output in the same file with all the non-overlapping motif1 alignments, and the motif2 alignment (plus any others in this overlap group) should be output to the overlaps file. Right?

Yes x2. Ideally, if the alignment of motif2 is not completely included in the alignment of motif1, I would also like to keep that small portion of motif2 in the non-overlapping alignments. I would also like to have 'N's in the summary file for regions were no repeats were found, but I don't know if this is something doable easily.

As for sorting, which output isn't sorted?

Sorry, I think I though it was not sorted because some repeats are always in the "overlapping alignments" section of the file, which is at the bottom. Moreover, I think I have not perfectly understood how the --out option works. Based on this:

The name template either names a single file or a collection of files. If if includes the substring "{motif}", this substring is replaced by a motif name and any un-overlapped alignments to that motif are written to that file. If the template name doesn't include "{motif}", all un-overlapped alignments and overlapping groups are written to one file.

Overlapping groups are either written to the console (if no name template is given), to the same file with alignments (if the name template doesn't contain "{motif}"), or to a a file separate from the alignments (with "{motif}" replaced by "overlaps").

it looks like I may use that option either to filter out alignments overlapping to one particular motif or to save overlapping groups to a separate file from the alignments, but I had no luck with that. Thanks in advance, Simone

rsharris commented 4 years ago

Thanks for the example, I will take a look at that.

Moreover, I think I have not perfectly understood how the --out option works If you do ncrf_resolve_overlaps xxx.ncrf.summary --out= xxx.{motif}.ncrf.summary you will get a separate summary output file for each motif, plus one for all the overlaps.

From the rest of your message, it sounds like that isn't what you would want.

I think I thought it was not sorted because some repeats are always in the "overlapping alignments" section of the file, which is at the bottom. Right. What it actually has is every overlap group with a blank line separating each group from the next. These are ordered by the number of items in each group. Ties are broken (I think) by position.

Neither output form is an the order you want.

I think I didn't consider worrying about output order there because I wasn't looking for the same thing you are. And a whitespace delimited table can be sorted using awk, like this: edited to change the sort for field 3 to dictionary

cat summary_file \
      | awk '/^#/  { print $0 }
             !/^#/ { print $0 | "env LC_ALL=C sort -k 3,3d -k 4,4n" }' \

Admittedly, that's not an ideal solution to expect users to know how to do (and it's klunky to have to tell awk which columns to sort on). But in my own use that was adequate.

Ideally, if the alignment of motif2 is not completely included in the alignment of motif1, I would also like to keep that small portion of motif2 in the non-overlapping alignments. I think this is why I didn't go any further down the path with overlaps. It gets messy trying to decide which alignments in an overlap group are the 'best'. Note that we might have a long series with A overlapping B, B overlapping C, C overlapping D (and perhaps more). So if we are to keep the best that might mean keeping A and C, or B and D, or A and D. And then carving the others into shrapnel, while doable, raises other questions (e.g. if the original alignments were subjected to a length limit, the shrapnel probably should be also, but this would require the user to give me that limit again).

I think considerations like those are why I decided to just separate out the overlaps from the non-overlaps, and pass the buck for figuring out what to do with the overlaps.

I would also like to have 'N's in the summary file for regions were no repeats were found, but I don't know if this is something doable easily.

This is interesting, because just two days ago I became aware that another user is apparently doing the same thing (see issue #6). Is there some downstream tool you're passing this output to?

MaestSi commented 4 years ago

Admittedly, that's not an ideal solution to expect users to know how to do (and it's klunky to have to tell awk which columns to sort on). But in my own use that was adequate.

Sorry, I have to admit it was not difficult to do, but I was just expecting the resolved summary file to keep somehow the ordering of the summary file obtained using ncrf_sort.py --sortby=name.

I think this is why I didn't go any further down the path with overlaps. It gets messy trying to decide which alignments in an overlap group are the 'best'. Note that we might have a long series with A overlapping B, B overlapping C, C overlapping D (and perhaps more). So if we are to keep the best that might mean keeping A and C, or B and D, or A and D. And then carving the others into shrapnel, while doable, raises other questions (e.g. if the original alignments were subjected to a length limit, the shrapnel probably should be also, but this would require the user to give me that limit again).

I understand the difficulties here. I think I may just go on manually deleting lines with shorter overlaps.

If you do ncrf_resolve_overlaps xxx.ncrf.summary --out= xxx.{motif}.ncrf.summary you will get a separate summary output file for each motif, plus one for all the overlaps.

Probably this is not something I need at the moment, but specifying, for example, CCTG_repeat (which is one of the motifs in the summary file) results in one single file with all the motifs and all the overlaps, I see no difference to redirecting the output of resolve_overlaps.py to one output file. Am I missing something?

Is there some downstream tool you're passing this output to?

No, there isn't. I am just inspecting the summary files with Excel, and I am interested in inserting Ns for better visualizing interruptions in the repetition. However, I understand this feature may not be appealing to many, and I think I can write some script on my own to do that. Thank you very much, Simone

rsharris commented 4 years ago

Probably this is not something I need at the moment, but specifying, for example, CCTG_repeat (which is one of the motifs in the summary file) results in one single file with all the motifs and all the overlaps, I see no difference to redirecting the output of resolve_overlaps.py to one output file. Am I missing something?

Probably non-obvious -- the filename should contain "{motif}" as a substring. Not one of the motifs, but the string "motif" inside two curly brackets. Then when non-overlapped motifs are written to a file, "{motif}" is replaced with the actual repeat. So in your case you'd have four files for non-overlaps, and one for the groups of repeats.

... but I was just expecting the resolved summary file to keep somehow the ordering of the summary file obtained using ncrf_sort.py --sortby=name.

A reasonable expectation. To really accomplish that ncrf_sort would have to put the sort option somewhere. Best place would be as some sort of comment in the sorted file. And this would have to get propagated into the summary file. Doable, but it would mess with the simplicity of the whitespace-with-headers-table format.

rsharris commented 4 years ago

I think I may just go on manually deleting lines with shorter overlaps.

Looking at the example you sent me (without posting a intimate details here), I see there are several overlap groups that include dozens of alignments. As well as some simpler cases.

Thinking about how to script something equivalent to what I think you are doing by deleting the shorter overlaps.

(1) Read and collect lines until a blank line is encountered. This is one overlap group.
(2) While the latest group is non-empty,
(2a) choose the longest alignment, and send it to output
(2b) delete all alignments that overlapped that one

It's greedy and may not give you the optimal covering of the interval. But this might be what you are doing manually anyway.

So I think the pipeline would be to run ncrf_resolve_overlaps with the option to output to several different files. Then run the above filtering script on the overlaps file, concatentate that filtered result with the individual motif files, and sort. Then pass it through another script to insert the Ns.

Thinking about the N insertion ...

(0) currentRead = ""
(1) Read line
(2) If the read name is different than currentRead
(2a) if (currentRead isn't "") and (currentEnd < currentReadLen)
(2aa)  output "N", currentRead, currentEnd, currentReadLen
(2b) currentRead = read name
(2c) currentReadLen = seqLen of this line
(2d) currentEnd = 0
(3) if (start > currentEnd)
(3a) output "N", currentRead, currentEnd, start
(4) output the latest line
(5) currentEnd = end
(Final) if (currentRead isn't "") and (currentEnd < currentReadLen)
(Final.a)  output "N", currentRead, currentEnd, currentReadLen

Actually, I'm going to go ahead and try the N stuff in awk, to see if I've missed anything.

rsharris commented 4 years ago

Actually, I'm going to go ahead and try the N stuff in awk, to see if I've missed anything.

Probably too long to be practical in awk, but here's what it boiled down to:

      | awk '/^#/  { print $0 }
             !/^#/ {
                   readName = $3;
                   if (readName != currentRead)
                     {
                     if ((currentRead != "") && (currentEnd < currentReadLen))
                       print "-","N",currentRead,currentEnd,currentReadLen,"+",currentReadLen;
                     currentRead = readName;
                     currentReadLen = $7;
                     currentEnd = 0;
                     }
                   start = $4;
                   end = $5;
                   if (start > currentEnd)
                     print "-","N",currentRead,currentEnd,start,"+",currentReadLen;
                   print $0;
                   currentEnd = end;
                   }
             END   {
                   if ((currentRead != "") && (currentEnd < currentReadLen))
                     print "-","N",currentRead,currentEnd,currentReadLen,"+",currentReadLen;
                   }'

MaestSi commented 4 years ago

Thanks for all the kind answers and explanations! I am following your suggestions for obtaining one single summary file sorted by read name and coordinate, that includes only the longest of overlapped alignments, and filled with Ns where there are no repeats detected. Simone

rsharris commented 4 years ago

I've updated the description of ncrf_resolve_overlaps to (hopefully) make it clearer what the outputs are.

MaestSi commented 4 years ago

Perfect, I think it is much clearer now. Thank you again, you can close the issue for me. Simone

aishsk87 commented 1 year ago

@MaestSi What is the length of your reads and did you have to convert the reads from fastq to fasta to run this program?

rsharris commented 1 year ago

@aishsk87 You do have to convert to fasta. I suppose that's an oversight on my part when I wrote it. But it's trivial to convert fastq to fasta on the fly using any of many different *nix commands. For example, cat example.fastq | sed -n '1~4,+1 { s/^@/>/ ; p }' | NCRF ... will convert the fastq file and feed it into NCRF.

MaestSi commented 1 year ago

Hi @aishsk87 , yes, I had to convert the reads from fastq to fasta. We analysed reads with lengths up to 50 kbp. You can find a wrapper to this wonderful tool here. Best, Simone

aishsk87 commented 1 year ago

Amazing, thanks!

makovalab-psu / NoiseCancellingRepeatFinder

Removing short overlapping repeats #7