Closed mtsokol closed 1 month ago
I'd love to see two things:
nbmake
for this. This is so the notebook doesn't go obsolete, and should be done in this PR.Does the notebook have large datasets/arrays or benchmarks? If so, we should remove those or convert them to be smaller for demonstration purposes, with deterministic data.
Can we run benchmarks on large data without storing it in the notebook?
also these notebooks look great Mateusz! Thanks!
Can we run benchmarks on large data without storing it in the notebook?
There are several in the examples/
folder, yes. Maybe we need a way to read the dimension length from an external env var so CI load is lower.
Hmm. Seems like the notebook is flaky on CI. It failed in only one of the last two commits, with no changes.
it looks like a very solid failure though, that csr is not meant to be (1, 0) ordered.
it looks like a very solid failure though, that csr is not meant to be (1, 0) ordered.
Well, asarray
should convert to the right format if necessary.
Ok, so we have this code snippet from the notebook that uses Finch backend:
X = sparse.random((100, 5), density=0.08) # creates COO random matrix
X = sparse.asarray(X, format="csc") # converts to CSC format
X_X = sparse.permute_dims(X, (1, 0)) @ X # for me locally it densifies as the result is: SwizzleArray(Tensor(Dense{Int64}(Dense{Int64}(Element{0.0, Float64, Int64}...
X_X.get_order() # it gives (1, 0) order so I can only convert to CSR.
X_X = sparse.asarray(X_X, format="csr") # move back from dense to CSR format
So it looks like in the CI the result order of permute_dims(X, (1, 0)) @ X
is sometimes (0, 1)
as it's the only reason for this failure.
Is this a finch version issue?
Ok, so we have this code snippet from the notebook that uses Finch backend:
X = sparse.random((100, 5), density=0.08) # creates COO random matrix X = sparse.asarray(X, format="csc") # converts to CSC format X_X = sparse.permute_dims(X, (1, 0)) @ X # for me locally it densifies as the result is: SwizzleArray(Tensor(Dense{Int64}(Dense{Int64}(Element{0.0, Float64, Int64}... X_X.get_order() # it gives (1, 0) order so I can only convert to CSR. X_X = sparse.asarray(X_X, format="csr") # move back from dense to CSR format
So it looks like in the CI the result order of
permute_dims(X, (1, 0)) @ X
is sometimes(0, 1)
as it's the only reason for this failure.
While the indeterminism here is definitely an issue, one other issue is that asarray
should convert the format if necessary, similar to the order=
or dtype=
kwargs.
Is this a finch version issue?
We use the latest finch-tensor
version -- but I'm unsure if that's behind as it pins Finch.jl
.
@willow-ahrens Could it be that the output order depends on the data? sparse.random
would produce different data/coordinates each time.
It is possible to set the finch version we use with an environment variable. I am wondering whether Mateusz has an env var set locally?
Also, no, the output order and format should be reproducible. As far as I can tell, it always fails on CI so far?
Could the difference between these commits cause the problem?
Could the difference between these commits cause the problem?
Nope, it was just re-enabling a lint and doesn't affect running Python code.
I can't reproduce locally (it always fails for me). I also get sparse(sparse format, I really do think something may also be up with our finch versions.
Hmm... now it passed with csr
... I couldn't reproduce it locally on macos or remote linux machine.
But I think once we merge https://github.com/willow-ahrens/finch-tensor/pull/75 we don't need to worry about it anymore as any format will be convertible to csc/csr
etc.
Ill work towards reproducing this nondeterminism. Maybe theres a similar kernel that behaves differently. Could we start by running in verbose mode?
Would y’all be okay if I push a few test commits?On Jun 24, 2024, at 5:25 PM, Mateusz Sokół @.***> wrote: Hmm... now it passed with csr... I couldn't reproduce it locally on macos or remote linux machine. But I think once we merge willow-ahrens/finch-tensor#75 we don't need to worry about it anymore as any format will be convertible to csc/csr etc.
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>
Would y’all be okay if I push a few test commits?
Sure!
Would y’all be okay if I push a few test commits?
I've just sent a collaborator invitation, if you accept it you should be able to push to Mateusz's branch.
I think stdout is being suppressed. Also, it's failing now saying tensordot is not defined. How are notebooks being tested, and can we get them to print stdout?
I think stdout is being suppressed. Also, it's failing now saying tensordot is not defined. How are notebooks being tested, and can we get them to print stdout?
The issue was that you defined X
as a lazy variable, used in sparse.compute(...)
but then also in:
b_hat = (inverted @ sparse.permute_dims(X, (1, 0))) @ y
without calling lazy
on inverted
or y
or calling compute. I updated it by defining X_lazy
separately. Now you can execute the cell to get the execution plan.
tensordot
error comes from the fact that args are mixed lazy and eager tensors here.
In the CI notebooks are tested by executing the whole notebook, if notebook passes then it's a ✅ without an output, if there's an error it reports stacktrace and points to the cell that failed.
@hameerabbasi I'm not sure about using nbmake
. Right now it fails with an internal nbmake
error https://github.com/pydata/sparse/actions/runs/9661787940/job/26650246800?pr=709 after upgrading finch-tensor
with the latest patch (version that passes everywhere else). I tried to debug it but I didn't find anything informative. Is there nbmake
alternative? Maybe we could merge this notebook without running in the CI.
@hameerabbasi I'm not sure about using
nbmake
. Right now it fails with an internalnbmake
error https://github.com/pydata/sparse/actions/runs/9661787940/job/26650246800?pr=709 after upgradingfinch-tensor
with the latest patch (version that passes everywhere else). I tried to debug it but I didn't find anything informative. Is therenbmake
alternative? Maybe we could merge this notebook without running in the CI.
Yes let's do that for now. I'll look for alternatives.
👍 Let me squash git history and hide that CI job for now.
@hameerabbasi Ok, I figured out what was the issue. It was out of memory error and I commented out the last test configurations in MTTKRP (1000x1000x1000). Let me clean the notebook up and squash.
I think we can stay with nbmake
but it could have reported OOM more explicitly.
@willow-ahrens You can force pull (I squashed commits) the branch and the notebook should run and print the execution plan in sparse.compute(..., verbose=True)
.
I released a new finch-tensor
so you also need pip install --upgrade finch-tensor
in your env to convert freely between formats.
wait, I'm not sure why the oom on mttkrp is related to the nondeterministic transposition order of X
wait, I'm not sure why the oom on mttkrp is related to the nondeterministic transposition order of X
Would you prefer that be resolved in this PR? AFAICT, the code here has no issues related to that. We can, of course, file an issue to track it if you prefer.
I'm happy to resolve in another PR! I'd like to understand though whether we have an explanation for that behavior, and whether it is still observed here. Is it explained by the OOM, for example? Was it just the finch-tensor version all along? Perhaps I can try to reproduce the problem with a simple test in finch-tensor.
I was confused because it didn't seem like we found the original source of nondeterminism, and I wanted to check whether y'all felt like you had found it.
wait, I'm not sure why the oom on mttkrp is related to the nondeterministic transposition order of X
@willow-ahrens They're unrelated: There's an issue where X.T @ X
returned different resulting order locally on my machine and in the CI. This caused that sparse.asarray(arr, format="csr")
failed in the CI as arr
had different order than CSR required. I implemented a fix in finch-tensor
that refines order handling and passes SwizzleArrays to copyto!
when we change storage, therefore any format can now be changed to any format, as copyto!
accepts SwizzleArrays as source and destination.
But there's and issue where copyto!(denese_source, any_target)
copies zeros too as non-zero values, therefore for dense source formats I needed to add dropfills
for this case, which creates another copy.
This fixed calling sparse.asarray(arr, format="csr")
for any format of arr
but in MTTKRP we call sparse.asarray(dense, format="csf")
which for largest dimension configuration caused an OOM as dense was copied twice (in copyto!
and dropfills
). I disabled running all configurations in the CI (we only want to make sure the notebook generally runs) and now the job passes.
feel free to merge though, you're right that this can go through.
@mtsokol thanks! Yes, I think dropfills should probably support transposition like copyto does. I'll add a PR.
Thank you! My two comments in https://github.com/willow-ahrens/Finch.jl/issues/609 describe these two issues.
Also, it's good to know that the nondeterminism was related to the Finch version. I'll add a more direct test for this to finch-tensor so we can see if it ever fails again nondeterministically in CI.
Merging, thanks for the follow-ups @willow-ahrens and @mtsokol!
Hi @willow-ahrens @hameerabbasi,
Here I add a short example notebook with benchmarks+plots for the latest alpha version released.