iTaxoTools / TaxI2-legacy

Calculates genetic differences between DNA sequences
GNU General Public License v3.0
0 stars 0 forks source link

IMPORTANT: Need to be able to run large files in Taxi2 - add progress information and option for large files - automatic output #39

Closed mvences closed 3 years ago

mvences commented 3 years ago

However, Taxi2 is one of the tools that potentially needs to process very large input files, and we need to find a way to deal with this. I am currently running it with large input files, and for a file of ca. 8000 sequences it crashes. Also with 1000 sequences it is taking extremely long (not unexpectedly because of course the number of pairwise alignments increases exponentially) and terminates with an error. Also 200 sequences gives an error message while 120 sequences works. I attach below a ZIP file with test files of various sizes that I used for testing.

I can imagine this would happen because calculating so many pairwise distances overflows the temporary memory. Or because many sequences are identical to each other, but this is a situation that can very well happen in real data sets.

So, there are two things that would be nice to add:

  1. Some kind of progress information for the user. I remember you said that adding a progress bar to the GUI was not properly working, and indeed the same was true when Sangeeta tried to do a similar thing. So, a visual progress bar is not a good idea. But maybe an option can be enabled that provides an "echo" of the commands and operations that are being performed ... with many sequences, for instance, giving the information which sequence is currently being aligned to all others to calculate the distances. In one of Stefanos tools (pyr8s) this information flows in the "Preview" window while the program is running, so no new window or box is necessary. This would be very helpful because the user can see whether something is still happening or the program may already have crashed.

  2. Add an option for large files where no temporary files are stored in memory but directly written on the hard drive. Either this can become a default (if easy to implement), or if not, it becomes one additional checkbox. Maybe the output files (especially alignment information and so on) could be successively written on a hard drive, and if these temporary files get over a certain size, maybe some of the more "problematic output is just skipped? In any case I imagine that something like the table with pairwise distances among all samples, each in a row (not as matrix) should be quite easy to write, even with VERY large input files, similar to what fastmerge or DNAconvert are doing. And from this, producing the graphs and some of the summary statistics should be easy as well. Maybe in this case, a preview of the files is not possible and can be skipped? But we need to make the program workable with at least 1000 sequences, better 10,000, even it takes a couple of days to complete the job and can only perform some of it.

Taxi2test.zip

necrosovereign commented 3 years ago

I added progress display with #40. But showing which sequence is being aligned will make the distances' calculation much slower, so I didn't add it.

The temporary files are not being held in memory, but are stored in a temporary folder.

Unfortunately, I don't know much about optimizing Python. But I don't think that the distance calculation (which seem to take the most time) can be optimized without using another language.

mvences commented 3 years ago

OK, so I have now tried the new version. With big files, it is not printing anything in the "Preview" window for a long time, which probably reflects that the most time-consuming step is the pairwise alignment, and the program then crashes due to a memory error.

Here are some suggestions:

  1. To better understand the progress of the pairwise alignments, maybe there is a way to record the progress of these pairwise alignments by just counting the progress of alignments. This would involve first, to count the total number of sequences in the file and calculate the number of pairwise comparisons to be done; say, there are 100000 pairwise comparisons that need to be done, then maybe print something like "1000 out of 100000 pairwise alignments done", 2000 out of 10000 pairwise alignments done" ... and so forth.

  2. Then, with a set of 200 sequences, the program crashes with the error pasted here below as screenshot. The last line writes about a memory error with an array. This to me reads like a problem in computing these very large pairwise matrices. Maybe as a first preliminary step, can you temporarily disable calculation of all of the big "triangular" matrices such as "pairwise_uncorrected_distance_between_sequences.txt" If we are really lucky, this will already solve the problem and make the program run more stable (even if it would take a very long time with large input files).

  3. However, I am afraid this will not be enough. So I think maybe it will be necessary to change the procedure in TaxI2 as follows: first, the program does an initial counting of sequences as suggested in point (1) above and based on this follows the following rule:

    • If there are 100 sequences or less, then perform the calculations as currently.
    • If there are more than 100 sequences, then the program goes on an entirely different route: It simply performs all of the pairwise alignments and comparisons, and writes the results successively into the file "Summary_statistics.txt". This should be a linear process, as in DNAconvert of large fastq files, where one line is added after the other, even if the file gets really large. And then, once this is completed, the other calculations are made by analyzing and summarizing the data that are stored in "Summary_statistics.txt" (but all of the "triangular matrices" are completely skipped and not calculated because probably again the arrays would take too much memory or violate some size limits).

Capture_Taxi2crash

necrosovereign commented 3 years ago

I don't know what is going on here. Allocating 30 MiB should not fail, if there is enough memory left on the computer. And to fill memory with such tables would require creating hundreds of them, which is much more than the program is creating explicitly.

It seems the memory usage starts to grow exponentially during the creation of the last table (for the file "Summary_statistics.txt"), which is not one of the "triangular" tables. Skipping this table would also require skipping the graphics output.

mvences commented 3 years ago

Hmm. Not sure what to think. Maybe you can try on your Linux system to run the files that I have provided in the ZIP file above, and see if it causes the same issues? I can run up to 120 sequences, from 200 or higher it always stops or crashes.

Maybe it will be necessary to reconsider the way the tables are built, see my point (3) above. Rather than first calculating all pairwise distances and then building the table, rather write already the final results into this table "Summary_statistics.txt" one by one, after each pairwise alignment is finished? Or is this already done?

necrosovereign commented 3 years ago

I've run Taxi2test1_ca200.tab several times. Its memory usage rises gradually from ~150 MB to ~650MB and then it rapidly fills almost the whole memory and I have to kill the process. I think "MemoryError" doesn't happen on Linux, because the OS automatically kills a process when there is no memory left.

mvences commented 3 years ago

So I think then it really will be important to understand which part of the process is filling up the memory. I can imagine it is all of the many pairwise alignments?

necrosovereign commented 3 years ago

It seems that it happens after all the alignments are performed. I think some transformation of the table of distances is doing something unexpected.

necrosovereign commented 3 years ago

I have an idea on how to improve the processing algorithm, which might improve memory usage. But it would require almost complete rewriting of it, which means that other issues cannot be addressed until it's complete.

mvences commented 3 years ago

This sounds very promising. Yes, go ahead - no worries about the other issues for now, we need to somehow solve the memory problem. All the other issues are very minor and can also remain unsolved for the time being!

necrosovereign commented 3 years ago

I think I managed to obtain some definite conclusions about memory usage.

It seems that for 200 sequences the distance calculation itself requires about 600 MB of memory. It should grow quadratically with the number of sequence. So a computer with 10GB memory would be able to calculate distances for maximum about 800 sequences. So I think that 800 sequences (for a 10GB computer) is the best I can achieve using only Python.

mvences commented 3 years ago

That would be very good. Yes you are right, the number of combinations and thus memory use rises quadratically not exponentially.

From my experience with other programs, I would have expected the limit to be somewhere between 400 and 2000 sequences. If you can get it to work up to 800 great, but we can also say we limit the number to 500 or 600 sequences and if the user inputs more than that, the program issues a warning that more sequences than allowed have been included.

necrosovereign commented 3 years ago

With #41 the program runs much better. But I feel that I can improve the distance calculating function by using Rust. The way it would work is that there will be a small dll, from which a function calculate_distances can be imported by Python.