vkostyukov / la4j

Linear Algebra for Java
http://la4j.org
Apache License 2.0
371 stars 109 forks source link

Can't use toBinary method on large matrices #282

Open HowardLander opened 7 years ago

HowardLander commented 7 years ago

The following code fails test class fails on a fairly small (but not sparse) matrix:

Here's the invocation:

/usr/java/jdk1.7.0_51/bin/java -Xmx8000m -cp .:./la4j-0.6.0.jar MatrixTestBug java.lang.IllegalArgumentException at java.nio.ByteBuffer.allocate(ByteBuffer.java:330) at org.la4j.matrix.sparse.CRSMatrix.toBinary(CRSMatrix.java:1007) at MatrixTestBug.main(MatrixTestBug.java:13)

and here's the code (MatrixTestBug.java)

import java.io.; import org.la4j.;

public class MatrixTestBug {

public static void main(String[] args) {

 org.la4j.matrix.sparse.CRSMatrix theMatrix = new            org.la4j.matrix.sparse.CRSMatrix(15000,15000);
   theMatrix.setAll(.5);

   try {
       FileOutputStream fos = new FileOutputStream("test.tmp");
       byte [] theMatrixBytes = theMatrix.toBinary();
   } catch (Exception e) {
       e.printStackTrace();
   }
}

}

One possible work around (at least for me) for this would be to restore the writeExternal and readExternal methods to the matrix classes. I suspect I could also ignore this problem if the matrix classes were serializable, as the only reason I am doing this is to write the matrix (and some other data) to a file. Any ideas?

Thanks Howard

vkostyukov commented 7 years ago

As a workaround, you can traverse the matrix manually and write it into the file.

HowardLander commented 7 years ago

Thanks for the response.

I suppose I could do that, but then wouldn't I lose the "sparsity" of it. Well, I guess I could encode a value (say -1) for the "empty cells", but that seems a bit ugly. How difficult would it be for these classes to be serializable? Do you think the default writeObject() and readObject() methods would work?

Howard

vkostyukov commented 7 years ago

Why losing sparsity? You can convert sparse matrices toMatrixMarket (which is super compact) and write it into a file. As far as I understand the problem is when you convert dense matrices.

Also feel free to send a PR with the solution - I'm happy to merge that in.

vkostyukov commented 7 years ago

BTW, you're trying to allocate 2gb matrix on a 800mb heap.

HowardLander commented 7 years ago

Are you sure? Here's my command line:

/usr/java/jdk1.7.0_51/bin/java -Xmx8000m -cp .:./la4j-0.6.0.jar MatrixTestBug

Howard

On 9/28/16 4:22 PM, Vladimir Kostyukov wrote:

BTW, you're trying to allocate 2gb matrix on a 800mb heap.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/vkostyukov/la4j/issues/282#issuecomment-250288240, or mute the thread https://github.com/notifications/unsubscribe-auth/ABNGrbx1wN0jY0xprA0ckhAtPxXXsbvsks5qusyigaJpZM4KJPlZ.

Howard Lander mailto:howard@renci.org Senior Research Software Developer Renaissance Computing Institute (RENCI) http://www.renci.org The University of North Carolina at Chapel Hill Duke University North Carolina State University 100 Europa Drive Suite 540 Chapel Hill, NC 27517 919-445-9651

vkostyukov commented 7 years ago

Ahh I see.

Actually, this looks like an integer overflow. Your cardinality is already pretty close to Int.MaxValue (15000 * 15000) and when we're trying to allocatate the output array we multiply that by 8 (as per a single double value) and get the negative result.

One solution to this would be to use ByteBuffer instead of byte[] so we can concat them (and even possibly do off-heap).

HowardLander commented 7 years ago

Hi Vladimir

I'm was unaware of Matrix Market. It looks interesting. I'll definitely check it out. Thanks for mentioning it.

As far as the dense matrix goes, that was just my way of testing. We have some matrices that, even sparse, are bigger than max int bytes, so I was actually trying to break my code when I broke yours!

I was definitely considering trying to make these classes serializable myself, but I didn't know if there was some reason you hadn't done this. If you think it's a reasonable idea, I'd be happy to give it a try.

FYI: here's the code I was testing. I was wondering what would happen if the binary matrix was larger than max int.

    try {
        FileOutputStream fos = new FileOutputStream("test.tmp");
        byte [] theMatrixBytes = theMatrix.toBinary();
        System.out.println("matrix size: " + theMatrixBytes.length);
        OutputStream theStream = new BufferedOutputStream(fos);
        ObjectOutputStream oos = new ObjectOutputStream(theStream);
        oos.writeInt(theMatrixBytes.length);
        oos.write(theMatrixBytes);
        oos.writeObject(nameSpace);
        oos.writeObject(theList);
        oos.close();
    } catch (Exception e) {
        e.printStackTrace();
    }

Howard

On 9/28/16 4:21 PM, Vladimir Kostyukov wrote:

Why losing sparsity? You can convert sparse matrices |toMatrixMarket| (which is super compact) and write it into a file. As far as I understand the problem is when you convert dense matrices.

Also feel free to send a PR with the solution - I'm happy to merge that in.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/vkostyukov/la4j/issues/282#issuecomment-250287795, or mute the thread https://github.com/notifications/unsubscribe-auth/ABNGrXUJLmptYOJXbITZm_XyOQ51pRD7ks5qusw3gaJpZM4KJPlZ.

Howard Lander mailto:howard@renci.org Senior Research Software Developer Renaissance Computing Institute (RENCI) http://www.renci.org The University of North Carolina at Chapel Hill Duke University North Carolina State University 100 Europa Drive Suite 540 Chapel Hill, NC 27517 919-445-9651

vkostyukov commented 7 years ago

I think if you swap to a dense (from CRS) you should be fine in that test. It's cheaper to write 15000x15000 dense matrix than to write 15000x15000 sparse matrix (with .setAll) since the later involves an overhead for maintaining sparcity.

HowardLander commented 7 years ago

See below.

On 9/28/16 4:30 PM, Vladimir Kostyukov wrote:

Ahh I see.

Actually, this looks like an integer overflow. Your cardinality is already pretty close to Int.MaxValue (15000 * 15000) and when we're trying to allocatate the output array https://github.com/vkostyukov/la4j/blob/master/src/main/java/org/la4j/matrix/sparse/CRSMatrix.java#L1021 we multiply that by 8 (as per a single double value) and get the negative result.

One solution to this would be to use |ByteBuffer| instead of |byte[]| so we can concat them (and even possibly do off-heap).

Would we have to change the toBinary() code, or is there something I can change in mine?

Howard

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/vkostyukov/la4j/issues/282#issuecomment-250290334, or mute the thread https://github.com/notifications/unsubscribe-auth/ABNGrYyogahgjNaKgMSO9pr7oMV0-B7iks5qus5_gaJpZM4KJPlZ.

Howard Lander mailto:howard@renci.org Senior Research Software Developer Renaissance Computing Institute (RENCI) http://www.renci.org The University of North Carolina at Chapel Hill Duke University North Carolina State University 100 Europa Drive Suite 540 Chapel Hill, NC 27517 919-445-9651

vkostyukov commented 7 years ago

Matrices/Vectors are used to be serializeable. I decided to not do that by the same reason no one else should do that. Java's serialization is completely broken (both performance-wise and security-wise). You should look at Jackson instead.

vkostyukov commented 7 years ago

I'm afraid not, there is no easy fix to that.

HowardLander commented 7 years ago

I don't doubt what you are saying. What happens when I have a sparse matrix 1 million by 1 million? What I I really need a way to write large CRS matrices of unknown sparsity to file along with some other relevant information.

Are the security issues you mentioned in your other message the reason that the writeExternal and readExternal methods were removed? I'm not a security expert at all.

Howard

On 9/28/16 4:33 PM, Vladimir Kostyukov wrote:

I think if you swap to a dense (from CRS) you should be fine in that test. It's cheaper to write 15000x15000 dense matrix than to write 15000x15000 sparse matrix (with |.setAll|) since the later involves an overhead for maintaining sparcity.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/vkostyukov/la4j/issues/282#issuecomment-250290960, or mute the thread https://github.com/notifications/unsubscribe-auth/ABNGrdejjv6xL8uO6F0QlQ6DRbz1C-7Xks5qus8RgaJpZM4KJPlZ.

Howard Lander mailto:howard@renci.org Senior Research Software Developer Renaissance Computing Institute (RENCI) http://www.renci.org The University of North Carolina at Chapel Hill Duke University North Carolina State University 100 Europa Drive Suite 540 Chapel Hill, NC 27517 919-445-9651

HowardLander commented 7 years ago

Potentially uninformed question: Is Jackson better than Gson? I've used Gson.

On 9/28/16 4:35 PM, Vladimir Kostyukov wrote:

Matrices/Vectors are used to be serializeable. I decided to not do that by the same reason no one else should do that. Java's serialization is completely broken (both performance-wise and security-wise). You should look at Jackson instead.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/vkostyukov/la4j/issues/282#issuecomment-250291515, or mute the thread https://github.com/notifications/unsubscribe-auth/ABNGrVPFBwmgZ2OQ-bMwGBq4QQRab49aks5qus-YgaJpZM4KJPlZ.

Howard Lander mailto:howard@renci.org Senior Research Software Developer Renaissance Computing Institute (RENCI) http://www.renci.org The University of North Carolina at Chapel Hill Duke University North Carolina State University 100 Europa Drive Suite 540 Chapel Hill, NC 27517 919-445-9651

vkostyukov commented 7 years ago

Thre is a limit on a max sparse matrix size in la4j. It's MxN=Int.MaxValue. There is no way you can do 1m x 1m with la4j, just because there is a limit on array size on JVM (the maximum array you can allocate is Int.MaxValue, 2gb). If you need support for bigger matrices, I'd recommend either:

  1. Build you own thing with a sub-set of features
  2. Use something else (MTJ and Parallel Colt support CRS/CCS)

Looking at this comparison table (which I'm quite sceptical about): https://java-matrix.org/ you can try MTJ and Parallel Colt and see if their limits are true for sparse matrices (16gb and 64gb correspondingly).

HowardLander commented 7 years ago

Thanks for the info. I probably should have said by now that I really like la4j. It's easy to work with and has all of the features (except for the serialization issues we have been discussing) that I need. I don't know what motivates folks like you to provide these sorts of libraries but I'm sure glad you do!

The largest matrix that we have currently is about 18600 by 18600, but we do have plans for bigger. Square root of max int is about 46000, so I guess you are right that we will probably have to look at something else fairly soon. I had looked at colt before and there was something I didn't like about it, but that was several years ago.

I may just live with the limitation for now. I do have a workaround

Thanks

Howard

On 9/28/16 4:52 PM, Vladimir Kostyukov wrote:

Thre is a limit on a max sparse matrix size in la4j. It's |MxN=Int.MaxValue|. There is no way you can do 1m x 1m with la4j, just because there is a limit on array size on JVM (the maximum array you can allocate is Int.MaxValue, 2gb). If you need support for bigger matrices, I'd recommend either:

  1. Build you own thing with a sub-set of features
  2. Use something else (MTJ and Parallel Colt support CRS/CCS)

Looking at this comparison table (which I'm quite sceptical about): https://java-matrix.org/ you can try MTJ and Parallel Colt and see if their limits are true for sparse matrices (16gb and 64gb correspondingly).

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/vkostyukov/la4j/issues/282#issuecomment-250296047, or mute the thread https://github.com/notifications/unsubscribe-auth/ABNGrcoTK9EBYmc_rJlDTad8rdkWGz2Bks5qutOIgaJpZM4KJPlZ.

Howard Lander mailto:howard@renci.org Senior Research Software Developer Renaissance Computing Institute (RENCI) http://www.renci.org The University of North Carolina at Chapel Hill Duke University North Carolina State University 100 Europa Drive Suite 540 Chapel Hill, NC 27517 919-445-9651

vkostyukov commented 7 years ago

Yeah, you should be fine with 18600x18600 matrices. I admit there is a bug in .toBinary (w.r.t to integer overflow) and if it's possible to allocate this matrix with la4j it should be possible to convert that into a binary form. Although, I don't have a good solution yet for this and you should feel free to suggest some.

In the meantime, I'd advice to shove your matrices into Jackson and see if it works (it should!). You don't need Java's standard serialization for that. Jackson (If you're ok with JSON) or Kryo (for binary format). These libraries will do the trick much better (and faster and safer).