hgb-bin-proteomics / CandidateSearch

Proof-of-concept implementation of a search engine that uses sparse matrix multiplication to identify the best peptide candidates for a given mass spectrum.
https://hgb-bin-proteomics.github.io/CandidateSearch
MIT License
1 stars 1 forks source link

Low core usage for big matrices ( > 29 000 000 rows ) #25

Closed michabirklbauer closed 7 months ago

michabirklbauer commented 9 months ago

On the CPU cluster the core usage drops to 5-10% when doing proteome-wide peptidoform search with 29 000 000 canidates. No clue why this happens.

Should test the C++ native implementation for matrices of this size, to eliminate possible errors coming from the c# marshalling side. If that behaviour is also seen just on the C++ side -> ask in Eigen Discord! Maybe it's an Eigen (or C++) related thing?

see here -> https://github.com/hgb-bin-proteomics/NC_Annika_prototypes/blob/master/C%2B%2BEigen/SpMV/Prototyping/main.cpp

Possible solutions might be splitting candidates into 5 000 000 or 10 000 000 chunks and just do a batched approach, this might be a good idea anyway with very big databases to get top_n * batches results instead of just top_n independent of database size

michabirklbauer commented 9 months ago

Investigated this, it seems like it never even reaches the multiplication part because creating the big matrix takes too long. Might need to ask Eigen Community if there is a better way than doing inserts/triplets. Technically I can do CSR / CSC representation myself if necessary.

michabirklbauer commented 9 months ago

-> see here for test implementations: https://github.com/hgb-bin-proteomics/NC_Annika_prototypes/blob/master/C%2B%2BEigen/SpMV/Prototyping/high_dim.cpp

michabirklbauer commented 8 months ago

Tried several different approaches to create these big matrices but without success. The following is what I think happens: Basically this is a hardware/OS/C++ based limitation of simply not being able to allocate a big enough continuous chunk of memory (even if there is enough total memory available). With virtual address space on a 64 bit system this shouldn't be a problem? But it is, C++ throws a bad allocation exception when trying to allocate the memory for a matrix of that size. Filling up the matrix by inserts works but is super slow to a point that it is not an option, this is probably due to the fact that the memory can't be pre-allocated and therefore re-allocation occurs on every insert. Filling the matrix by triplets, whether unsorted (stable Eigen) or sorted (main branch Eigen) doesn't matter, it always crashes with a bad allocation exception. A suggestion by the Eigen Community was to create a forward iterator for the triplets instead of storing them in an std::vector, to reduce memory overhead. However, I doubt that would solve the problem as the exception is thrown during the matrix creation and not when creating the std::vector. Moving forward, I will not implement the matrix subsplitting since we won't need it in MS Annika. Instead I will:

michabirklbauer commented 8 months ago

Another reason could be that the array indices overflow? E.g. the number of elements in CSR format for the value array exceeds int32_max