pydata / sparse

Sparse multi-dimensional arrays for the PyData ecosystem
https://sparse.pydata.org
BSD 3-Clause "New" or "Revised" License
581 stars 123 forks source link

Add example notebook #709

Closed mtsokol closed 1 month ago

mtsokol commented 1 month ago

Hi @willow-ahrens @hameerabbasi,

Here I add a short example notebook with benchmarks+plots for the latest alpha version released.

hameerabbasi commented 1 month ago

I'd love to see two things:

  1. Testing that a notebook can be executed (without checking the output). We can use e.g. nbmake for this. This is so the notebook doesn't go obsolete, and should be done in this PR.
  2. We can, as a follow-up, include notebooks in documentation. Maybe a task for the upcoming intern.
hameerabbasi commented 1 month ago

Does the notebook have large datasets/arrays or benchmarks? If so, we should remove those or convert them to be smaller for demonstration purposes, with deterministic data.

willow-ahrens commented 1 month ago

Can we run benchmarks on large data without storing it in the notebook?

willow-ahrens commented 1 month ago

also these notebooks look great Mateusz! Thanks!

hameerabbasi commented 1 month ago

Can we run benchmarks on large data without storing it in the notebook?

There are several in the examples/ folder, yes. Maybe we need a way to read the dimension length from an external env var so CI load is lower.

hameerabbasi commented 1 month ago

Hmm. Seems like the notebook is flaky on CI. It failed in only one of the last two commits, with no changes.

willow-ahrens commented 1 month ago

it looks like a very solid failure though, that csr is not meant to be (1, 0) ordered.

hameerabbasi commented 1 month ago

it looks like a very solid failure though, that csr is not meant to be (1, 0) ordered.

Well, asarray should convert to the right format if necessary.

mtsokol commented 1 month ago

Ok, so we have this code snippet from the notebook that uses Finch backend:

X = sparse.random((100, 5), density=0.08)  # creates COO random matrix
X = sparse.asarray(X, format="csc")  # converts to CSC format
X_X = sparse.permute_dims(X, (1, 0)) @ X  # for me locally it densifies as the result is: SwizzleArray(Tensor(Dense{Int64}(Dense{Int64}(Element{0.0, Float64, Int64}...

X_X.get_order()  # it gives (1, 0) order so I can only convert to CSR. 

X_X = sparse.asarray(X_X, format="csr")  # move back from dense to CSR format

So it looks like in the CI the result order of permute_dims(X, (1, 0)) @ X is sometimes (0, 1) as it's the only reason for this failure.

willow-ahrens commented 1 month ago

Is this a finch version issue?

hameerabbasi commented 1 month ago

Ok, so we have this code snippet from the notebook that uses Finch backend:

X = sparse.random((100, 5), density=0.08)  # creates COO random matrix
X = sparse.asarray(X, format="csc")  # converts to CSC format
X_X = sparse.permute_dims(X, (1, 0)) @ X  # for me locally it densifies as the result is: SwizzleArray(Tensor(Dense{Int64}(Dense{Int64}(Element{0.0, Float64, Int64}...

X_X.get_order()  # it gives (1, 0) order so I can only convert to CSR. 

X_X = sparse.asarray(X_X, format="csr")  # move back from dense to CSR format

So it looks like in the CI the result order of permute_dims(X, (1, 0)) @ X is sometimes (0, 1) as it's the only reason for this failure.

While the indeterminism here is definitely an issue, one other issue is that asarray should convert the format if necessary, similar to the order= or dtype= kwargs.

hameerabbasi commented 1 month ago

Is this a finch version issue?

We use the latest finch-tensor version -- but I'm unsure if that's behind as it pins Finch.jl.

hameerabbasi commented 1 month ago

@willow-ahrens Could it be that the output order depends on the data? sparse.random would produce different data/coordinates each time.

willow-ahrens commented 1 month ago

It is possible to set the finch version we use with an environment variable. I am wondering whether Mateusz has an env var set locally?

willow-ahrens commented 1 month ago

Also, no, the output order and format should be reproducible. As far as I can tell, it always fails on CI so far?

hameerabbasi commented 1 month ago

@willow-ahrens It's indeterministic as to whether it passes: passing, failing, diff.

willow-ahrens commented 1 month ago

Could the difference between these commits cause the problem?

hameerabbasi commented 1 month ago

Could the difference between these commits cause the problem?

Nope, it was just re-enabling a lint and doesn't affect running Python code.

willow-ahrens commented 1 month ago

I can't reproduce locally (it always fails for me). I also get sparse(sparse format, I really do think something may also be up with our finch versions.

mtsokol commented 1 month ago

Hmm... now it passed with csr... I couldn't reproduce it locally on macos or remote linux machine.

But I think once we merge https://github.com/willow-ahrens/finch-tensor/pull/75 we don't need to worry about it anymore as any format will be convertible to csc/csr etc.

willow-ahrens commented 1 month ago

Ill work towards reproducing this nondeterminism. Maybe theres a similar kernel that behaves differently. Could we start by running in verbose mode?

willow-ahrens commented 1 month ago

Would y’all be okay if I push a few test commits?On Jun 24, 2024, at 5:25 PM, Mateusz Sokół @.***> wrote: Hmm... now it passed with csr... I couldn't reproduce it locally on macos or remote linux machine. But I think once we merge willow-ahrens/finch-tensor#75 we don't need to worry about it anymore as any format will be convertible to csc/csr etc.

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>

mtsokol commented 1 month ago

Would y’all be okay if I push a few test commits?

Sure!

hameerabbasi commented 1 month ago

Would y’all be okay if I push a few test commits?

I've just sent a collaborator invitation, if you accept it you should be able to push to Mateusz's branch.

willow-ahrens commented 1 month ago

I think stdout is being suppressed. Also, it's failing now saying tensordot is not defined. How are notebooks being tested, and can we get them to print stdout?

mtsokol commented 1 month ago

I think stdout is being suppressed. Also, it's failing now saying tensordot is not defined. How are notebooks being tested, and can we get them to print stdout?

The issue was that you defined X as a lazy variable, used in sparse.compute(...) but then also in:

b_hat = (inverted @ sparse.permute_dims(X, (1, 0))) @ y

without calling lazy on inverted or y or calling compute. I updated it by defining X_lazy separately. Now you can execute the cell to get the execution plan.

tensordot error comes from the fact that args are mixed lazy and eager tensors here.

In the CI notebooks are tested by executing the whole notebook, if notebook passes then it's a ✅ without an output, if there's an error it reports stacktrace and points to the cell that failed.

mtsokol commented 1 month ago

@hameerabbasi I'm not sure about using nbmake. Right now it fails with an internal nbmake error https://github.com/pydata/sparse/actions/runs/9661787940/job/26650246800?pr=709 after upgrading finch-tensor with the latest patch (version that passes everywhere else). I tried to debug it but I didn't find anything informative. Is there nbmake alternative? Maybe we could merge this notebook without running in the CI.

hameerabbasi commented 1 month ago

@hameerabbasi I'm not sure about using nbmake. Right now it fails with an internal nbmake error https://github.com/pydata/sparse/actions/runs/9661787940/job/26650246800?pr=709 after upgrading finch-tensor with the latest patch (version that passes everywhere else). I tried to debug it but I didn't find anything informative. Is there nbmake alternative? Maybe we could merge this notebook without running in the CI.

Yes let's do that for now. I'll look for alternatives.

mtsokol commented 1 month ago

👍 Let me squash git history and hide that CI job for now.

mtsokol commented 1 month ago

@hameerabbasi Ok, I figured out what was the issue. It was out of memory error and I commented out the last test configurations in MTTKRP (1000x1000x1000). Let me clean the notebook up and squash.

I think we can stay with nbmake but it could have reported OOM more explicitly.

mtsokol commented 1 month ago

@willow-ahrens You can force pull (I squashed commits) the branch and the notebook should run and print the execution plan in sparse.compute(..., verbose=True).

I released a new finch-tensor so you also need pip install --upgrade finch-tensor in your env to convert freely between formats.

willow-ahrens commented 1 month ago

wait, I'm not sure why the oom on mttkrp is related to the nondeterministic transposition order of X

hameerabbasi commented 1 month ago

wait, I'm not sure why the oom on mttkrp is related to the nondeterministic transposition order of X

Would you prefer that be resolved in this PR? AFAICT, the code here has no issues related to that. We can, of course, file an issue to track it if you prefer.

willow-ahrens commented 1 month ago

I'm happy to resolve in another PR! I'd like to understand though whether we have an explanation for that behavior, and whether it is still observed here. Is it explained by the OOM, for example? Was it just the finch-tensor version all along? Perhaps I can try to reproduce the problem with a simple test in finch-tensor.

willow-ahrens commented 1 month ago

I was confused because it didn't seem like we found the original source of nondeterminism, and I wanted to check whether y'all felt like you had found it.

mtsokol commented 1 month ago

wait, I'm not sure why the oom on mttkrp is related to the nondeterministic transposition order of X

@willow-ahrens They're unrelated: There's an issue where X.T @ X returned different resulting order locally on my machine and in the CI. This caused that sparse.asarray(arr, format="csr") failed in the CI as arr had different order than CSR required. I implemented a fix in finch-tensor that refines order handling and passes SwizzleArrays to copyto! when we change storage, therefore any format can now be changed to any format, as copyto! accepts SwizzleArrays as source and destination. But there's and issue where copyto!(denese_source, any_target) copies zeros too as non-zero values, therefore for dense source formats I needed to add dropfills for this case, which creates another copy.

This fixed calling sparse.asarray(arr, format="csr") for any format of arr but in MTTKRP we call sparse.asarray(dense, format="csf") which for largest dimension configuration caused an OOM as dense was copied twice (in copyto! and dropfills). I disabled running all configurations in the CI (we only want to make sure the notebook generally runs) and now the job passes.

willow-ahrens commented 1 month ago

feel free to merge though, you're right that this can go through.

willow-ahrens commented 1 month ago

@mtsokol thanks! Yes, I think dropfills should probably support transposition like copyto does. I'll add a PR.

mtsokol commented 1 month ago

Thank you! My two comments in https://github.com/willow-ahrens/Finch.jl/issues/609 describe these two issues.

willow-ahrens commented 1 month ago

Also, it's good to know that the nondeterminism was related to the Finch version. I'll add a more direct test for this to finch-tensor so we can see if it ever fails again nondeterministically in CI.

hameerabbasi commented 1 month ago

Merging, thanks for the follow-ups @willow-ahrens and @mtsokol!