iTaxoTools / TaxI2-legacy

Calculates genetic differences between DNA sequences
GNU General Public License v3.0
0 stars 0 forks source link

"Compare against reference database" also having memory issues #51

Closed mvences closed 3 years ago

mvences commented 3 years ago

The new function works well and I have been able to run up to 2000 sequences against a small reference database (50 sequences). However, with 5000 or 9000 sequences the program stops again with memory problems, as in the screenshot attached here:

Capture_22March

Also, if I run 1000 sequences against a larger reference database (500 seqs) I get similar memory problems.

So, what to do? At least in this function of the program, it would be important to be able to run input files of almost unlimited size (at least, several 10,000 sequences). Probably there will always be a limit in the maximum size of the reference database, but this is a secondary problem.

I see two solutions:

  1. Somehow I still believe there must be a way to get large files processed because the general procedure should not be too different from that in DNAconvert: Take one item out of one text file, do something with it (either, convert it, or compare it against other sequences and find the closest sequence) and then print one line of results into another text file (and then start the next process, "forgetting" everything that has been done before except for the one text line with results). Even if the database would be very large, say, 10,000 sequences ... the single calculation in each case would be only a comparison of 1 query sequence against 10,000 reference sequences. This will take a lot of time but should not be unfeasible. And once it is done, all memory and almost all temporary files that arose in the process can be cleared, and the process repeated with the next query sequence. It is clear that this will be very (!) time consuming with large data sets, especially if the reference database is large. But in principle, memory problems should be avoidable.

  2. There may be an alternative "turnaorund option" that could be implemented with very large files, but I would only go for this if really we find no better way. The analysis could be partitioned into smaller portions. We see that 1000 query sequences are usually processed well with small databases, and 100 sequences with large reference databases so maybe, the program can (a) run the first 100 or 1000 sequences and produce an output file labelled "1-1000" or "1-100", then completely clear the memory (remove all temporary files except the output file(s) produced so far, and then tackle the next set query sequences. In the end, the user would have multiple output files (but it should also be possible to add a "merging" process at the very end to have one complete output file). This option clearly is not ideal, but it would serve to make the program almost fully functional and usable in the short term.

necrosovereign commented 3 years ago

I tried to remove all sources of memory leaks in #52. Can you try to check how fast the memory usage grows?

mvences commented 3 years ago

Currently some other kind of error is shown, probably not related to memory:

Captureerror

necrosovereign commented 3 years ago

I added the progress indicator, but forgot to initialize the variable that holds the time when processing started. It should be fixed in #53

mvences commented 3 years ago

Very frustrating ... the program continues to crash at about the same amount of sequences (ca. 2100 with a small reference database of 50 sequences, ca. 900 with a larger database of 120 sequences). Seems to be very regular, after ca. 100,000 comparisons the memory overflow shows up.

The good thing now is that this can be traced because the number of comparisons is printed, this s very good!

Let's see if there is a way of solving this. Seems to me the memory is still not "deleted" after each comparison which is strange. Sangeeta says we may have to work with first building a reference databases with SQL, but somehow I am sure there is a way to avoid this.

Capture_memoryerroragain Capture_GUI

StefanPatman commented 3 years ago

I was reading through our remaining issues and decided to have a look at this.

The culprit appears to be PairwiseAligner from Bio.Align. It created memory leaks even when I instantiated it anew for each comparison. Perhaps there is something in Biopython's documentation to help, but I suspect this is a bug in their code.

To work around this, I suggest the use of multiprocessing by programstate.py/make_distance_table. The sequences would be compared in "chunks", each on a new process pool. The size of the chunks could be chosen depending on the input size. When a pool of processes "dies", their memory is freed and the leaks are negated. In my tests, the memory leaks remained on the workers between tasks and the pool had to be reset to free memory.

As a bonus, these tasks could be assigned to the pool in parallel to make use of multiple CPU cores, which should provide a substantial speed-boost on modern computers. This would counteract the performance cost of starting new sub-processes. Some performance testing could go a long way to decide on the optimal chunk size and number of workers.

Proof-of-concept code I used in my tests:

from multiprocessing import Pool
...
def make_distance_table(table: pd.DataFrame, already_aligned: bool) -> pd.DataFrame:
...
    _len = len(table["sequence"])
    _chunk = 2 # how many sequences to compare at once
    _times = int(_len / _chunk) - 1 # we ignore remainder sequences but this was ok for testing
    # this is not as fast as numpy iteration but ufuncs aren't optimal anyway
    for i in range(0, _times): 
        _from = i*_chunk
        _to = (i+1)*_chunk - 1
        for j in range(_from, _to): # target
            for k in range(0, _len -1): # query
                with Pool(2) as p: # number of workers doesn't matter here since we assign tasks synchronously
                    x = p.apply(seq_distances, [table["sequence"][j], table["sequence"][k]]) # could run this in parallel instead
                    print(x) # do something with the result
...

I also looked into the sizes of the produced matrix. For 50x2000 matrix with 100,000 "slots", each slot having a tuple of four floats, we need 100,000 x 4 x 32 bytes = 12.8 Megabytes, which isn't much at all. For larger inputs of course, this becomes a problem. Writing intermediary results in a file as Miguel proposed above seems like a great solution. It would also work great with what I suggested, perhaps by writing the result of each chunk on a temporary file.

I hope this was helpful, please let me know if there is something else I can do.

necrosovereign commented 3 years ago

@StefanPatman I tried using multiprocessing to implement a progress bar for morphometricanalyzer (https://github.com/iTaxoTools/morphometricanalyzer/issues/16) but it triggered a X server error. It seems that some interactions between tkinter and multiprocessing create thread issues.

StefanPatman commented 3 years ago

I wouldn't know what the problem with that was, perhaps you were trying to update the GUI from a different process. I successfully ran the above code modification on Python 3.6.13 on Archlinux without any X errors.

necrosovereign commented 3 years ago

@mvences I removed the global aligner object and made it so that a fresh aligner is created and destroyed for each alignment (pull request #54). It doesn't seem to significantly impact performance. I hope it will solve this memory issue.

StefanPatman commented 3 years ago

It created memory leaks even when I instantiated it anew for each comparison.

Maybe I missed something though, let's hope for the best.

mvences commented 3 years ago

Sh....

The program runs out of memory after exactly the same number of alignments, again!

If really there is a suspicion that the memory leakage comes from the Biopython aligner, maybe we then need to use another library for the alignment? One option would be "lingpy" (http://lingpy.org/reference/lingpy.align.html). This is an unusual source for aligning molecular sequences, as it is for linguistic analyses, but it is a dedicated alignment package. Maybe it would be possible to just write a short piece of code that uses lingpy.align to perform a large number of iterative alignments and monitor if there is memory leakage? So we would avoid a large amount of work implementing a new library only to find the same error again? @Stefanos, would this be possible?

Alternatively: What about just using a simple aligner WITHOUT dedicated library? There is code on Github that does such alignments without referring to Biopython or others, but I am not sure how good and reliable it is. Not sure if this would help, anyway?

mvences commented 3 years ago

To see the bright side of it, at least we are all learning a lot about memory leakage and problems with large-scale analyses using Python code ....

StefanPatman commented 3 years ago

I'm quite confident that the biopython aligner is the problem here. It seems like they had memory leak problems a year ago, while there are still discussions about a similar case from 2 months ago.

Switching to a different library could indeed fix this. I am not convinced lingpy is a good choice though; it looks like it doesn't offer extensive penalty/score calibration like biopython. In addition, lingpy is written in pure python, which is most likely slower that biopython's C extensions. Edit: there is actually some Cython in their source, so maybe it's not that bad after all

I would argue against writing our own library for this. Maybe we could report this issue on biopython's github so it can be fixed in the future. But if we want a 2-day solution, multiprocessing is the way to go.

@necrosovereign Should I create an experimental branch with a few changes? My code from above performs horribly speed-wise but I could have something decent by tomorrow. Then we can check if the problem is solved effectively for all and whether tkinter threading gets in the way.

mvences commented 3 years ago

Hello Stefanos,

if you could try an experimental branch, it would be fantastic!

In the long run, I anyway think TaxI2 will require a major overhaul because now that I see it developing, I am sure there will be a very large community of researchers using it, and I have many ideas of additional functionalities that could be added. So probably, we will need to use more extensively C code, multiprocessor approaches etc to deal with the time-consuming tasks. But I would say, this will then be "TaxI3", with additional function, which probably could by itself warrant a separate publication.

For now, let's have TaxI2 up and running in the current design, but with the capacity to process large amounts of sequences, even if it keeps being slow. So, Stefanos, if you could make a suggestion for this, it would be great.

@necrosovereign, I hope you agree with this suggestion.

necrosovereign commented 3 years ago

I think opening up this experimental branch might be a good idea. At the same time I feel quite confident that I can program a Rust routine that can replace the aligner and will perform without major flaws. I will not need much time to do this but I am uncertain how convenient the integration of the Rust code into the python program on Windows will turn out to be. So maybe Stefanos and I can work independently towards these two solutions (experimental branch and Rust code) and then compare the results?

mvences commented 3 years ago

@necrosovereign, this sounds like a very good idea. Also, it will make us understand for future projects if Rust is a good solution for such kinds of computational tasks. And trying two alternative options for a "short-term patch" makes a lot of sense.

StefanPatman commented 3 years ago

I just pushed the branch feature-multiprocessing that implements a simple approach. Each chunk compares 10 lines from the query sequence to the reference database. Each process only executes one chunk and then frees the memory.

This runs almost as fast as before on one CPU core while avoiding the aligner memory leaks. When more processors are available it will use all of them in parallel. The speedup factor isn't very impressive but it's definitely there (3.5 times faster with 4 cores). This could possibly be improved by cutting bigger chunks, but I decided for a safe approach since the memory leaks are inconsistent.

I did not implement the part where large input files are written on the disk to save memory just yet. Right now the program should be able to handle about 7,000,000 comparisons per gigabyte of memory. So an all-against-all comparison of the 2,000 sample file should work fine, but for the 9,000 input more work is required. It might be better if Vladimir did this since it would require changing more than just one function.

It would be great if you could pull then new branch and confirm this works for you. I did not extensively test this, although I successfully compiled it under Windows 7 (with slight modifications to include the data folder).

necrosovereign commented 3 years ago

The feature-multiprocessing works quite well. The memory usage stays bounded and I observed speedup of about 2.3 (with 4 virtual cores)

necrosovereign commented 3 years ago

@mvences I implemented the Needleman-Wunsch algorith in Rust. In my tests this algorithm gave the same alignments as the Biopython alignment. The sequences used in the tests were randomly generated and about 8 nucleotides long.

mvences commented 3 years ago

OK, so for now, I have been testing the version on the @StefanPatman branch. This has clearly improved the performance. I had a crash with the reference database comparison of 9000 sequences against a reference set of 50, but maybe this was because of some error in the sequence file (there was no memory error notice). Another test with a 120 seq database and 5000 query sequences worked just fine. So, clearly, the performance has been much improved and this version of TaxI will be usable for normal data sets.

My suggestion would be the following:

StefanPatman commented 3 years ago

I checked and even though the 50x9000 task completed for me, it did strain the memory. I pushed a slightly smarter solution where the chunks are split depending on a maximum number of comparisons per process. This is still not perfect, as I suspect that number changes depending on sequence lengths. I also solved a bug with unequal matrix splits.

I'm not familiar with Rust but Vladimir's approach should provide a better permanent solution. Then we could reliably decide how much memory is needed, assigning chunks and writing to disk accordingly.

StefanPatman commented 3 years ago

Hello @necrosovereign, could you please merge these changes when you have some time? I previously used the wrong numpy function for array splitting, which resulted in a crash when Miguel tried a different dataset.