Closed michabirklbauer closed 7 months ago
Investigated this, it seems like it never even reaches the multiplication part because creating the big matrix takes too long. Might need to ask Eigen Community if there is a better way than doing inserts/triplets. Technically I can do CSR / CSC representation myself if necessary.
-> see here for test implementations: https://github.com/hgb-bin-proteomics/NC_Annika_prototypes/blob/master/C%2B%2BEigen/SpMV/Prototyping/high_dim.cpp
Tried several different approaches to create these big matrices but without success. The following is what I think happens: Basically this is a hardware/OS/C++ based limitation of simply not being able to allocate a big enough continuous chunk of memory (even if there is enough total memory available). With virtual address space on a 64 bit system this shouldn't be a problem? But it is, C++ throws a bad allocation exception when trying to allocate the memory for a matrix of that size. Filling up the matrix by inserts works but is super slow to a point that it is not an option, this is probably due to the fact that the memory can't be pre-allocated and therefore re-allocation occurs on every insert. Filling the matrix by triplets, whether unsorted (stable Eigen) or sorted (main branch Eigen) doesn't matter, it always crashes with a bad allocation exception. A suggestion by the Eigen Community was to create a forward iterator for the triplets instead of storing them in an std::vector, to reduce memory overhead. However, I doubt that would solve the problem as the exception is thrown during the matrix creation and not when creating the std::vector. Moving forward, I will not implement the matrix subsplitting since we won't need it in MS Annika. Instead I will:
Another reason could be that the array indices overflow? E.g. the number of elements in CSR format for the value array exceeds int32_max
On the CPU cluster the core usage drops to 5-10% when doing proteome-wide peptidoform search with 29 000 000 canidates. No clue why this happens.
Should test the C++ native implementation for matrices of this size, to eliminate possible errors coming from the c# marshalling side. If that behaviour is also seen just on the C++ side -> ask in Eigen Discord! Maybe it's an Eigen (or C++) related thing?
see here -> https://github.com/hgb-bin-proteomics/NC_Annika_prototypes/blob/master/C%2B%2BEigen/SpMV/Prototyping/main.cpp
Possible solutions might be splitting candidates into 5 000 000 or 10 000 000 chunks and just do a batched approach, this might be a good idea anyway with very big databases to get top_n * batches results instead of just top_n independent of database size