ad.experimental.concat_on_disk fills with zeros some obs when concatenating

Please make sure these conditions are met

[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of anndata.
[ ] (optional) I have confirmed this bug exists on the master branch of anndata.

Report

Code:

ad.experimental.concat_on_disk(in_files = files_to_concat, out_file = output_dir + 'In_Vivo_NHP_and_Human.h5ad',
                              max_loaded_elems=10000000, axis=0, join='inner', merge='first',
                             fill_value=0)

On this version of anndata: (0.10.5.post1 -c conda-forge), I have been experiencing some obs loosing their values. I checked and it starts at exactly at the location 694000 (with more than 1k cells having zero counts). When I plot the np.sum(adata[693000:695000,:].X, axis=1) 694000 and above are zeros (it regains counts at an unknown location +2or 3k up I believe). This is within a single sample (so not between merged samples). This sample was aprior QC'd with all obs filtered to >1000 counts. I repeated the concat on disk and again same problem using this version of adata.

This problem was mentioned here before, but never followed up on to my understanding. https://discourse.scverse.org/t/counts-in-layers-is-zero-after-ad-concat/1999

I then updated to version 0.10.8, and the problem was fixed! Here's how I confirmed the fix:

rsc.pp.calculate_qc_metrics(adata, expr_type='counts', var_type='genes', qc_vars=None, log1p=True, layer=None)
sc.pl.violin(adata, ["n_genes_by_counts", "total_counts"],jitter=0.4,
    multi_panel=True, save = 'violin_post_merge.png')

violinviolin_post_merge

That one outlier I believe is from converting to one-to-one orthologues and it must have dropped alot of non evolutionarily conserved genes. ....... ahhhh I don't know now.... because it has zero counts.... I'll look into it.

This has been many many edits I shoulda waited to post. But, the below shows no spots with counts below 1,000, yet.... the problem remains, adn is messing up my code downstream. I will continue to debug.


lowest_values = []
for n in range(0, 1046120, 100000):
    chunk = adata[n:n+100000, :].X
    row_sums = chunk.sum(axis=1)
    min_index = row_sums.argmin()
    lowest_value = row_sums[min_index]
    lowest_values.append(lowest_value)

print(lowest_values)

Can't find that cell. It might be some other problem in the code I think this is good for now. Closing it.

Versions

-----
anndata             0.10.8
numpy               1.26.3
pandas              2.2.0
scanpy              1.9.8
session_info        1.0.0
-----
PIL                         10.2.0
anyio                       NA
array_api_compat            1.4.1
asttokens                   NA
attr                        23.1.0
attrs                       23.1.0
babel                       2.11.0
brotli                      1.1.0
certifi                     2024.07.04
cffi                        1.16.0
charset_normalizer          3.3.2
colorama                    0.4.6
comm                        0.1.2
cycler                      0.12.1
cython_runtime              NA
dateutil                    2.8.2
debugpy                     1.6.7
decorator                   5.1.1
defusedxml                  0.7.1
executing                   0.8.3
fastjsonschema              NA
google                      NA
h5py                        3.10.0
idna                        3.6
igraph                      0.11.4
ipykernel                   6.28.0
jedi                        0.18.1
jinja2                      3.1.3
joblib                      1.3.2
json5                       NA
jsonschema                  4.19.2
jsonschema_specifications   NA
jupyter_events              0.8.0
jupyter_server              2.10.0
jupyterlab_server           2.25.1
kiwisolver                  1.4.5
leidenalg                   0.10.2
llvmlite                    0.41.1
markupsafe                  2.1.4
matplotlib                  3.8.2
mpl_toolkits                NA
natsort                     8.4.0
nbformat                    5.9.2
numba                       0.58.1
overrides                   NA
packaging                   23.2
parso                       0.8.3
pexpect                     4.9.0
pkg_resources               NA
platformdirs                3.11.0
prometheus_client           NA
prompt_toolkit              3.0.43
psutil                      5.9.8
ptyprocess                  0.7.0
pure_eval                   0.2.2
pydev_ipython               NA
pydevconsole                NA
pydevd                      2.9.5
pydevd_file_utils           NA
pydevd_plugins              NA
pydevd_tracing              NA
pygments                    2.17.2
pyparsing                   3.1.1
pythonjsonlogger            NA
pytz                        2023.4
referencing                 NA
requests                    2.31.0
rfc3339_validator           0.1.4
rfc3986_validator           0.1.1
rpds                        NA
scipy                       1.12.0
send2trash                  NA
six                         1.16.0
sklearn                     1.4.0
sniffio                     1.3.0
socks                       1.7.1
stack_data                  0.2.0
texttable                   1.7.0
threadpoolctl               3.2.0
torch                       2.3.1.post100
torchgen                    NA
tornado                     6.3.3
tqdm                        4.66.1
traitlets                   5.14.1
typing_extensions           NA
urllib3                     2.2.0
wcwidth                     0.2.13
websocket                   1.7.0
yaml                        6.0.1
zmq                         25.1.2
zoneinfo                    NA
-----
IPython             8.20.0
jupyter_client      8.6.0
jupyter_core        5.5.0
jupyterlab          4.0.8
-----
Python 3.11.7 | packaged by conda-forge | (main, Dec 23 2023, 14:43:09) [GCC 12.3.0]
Linux-6.1.0-23-cloud-amd64-x86_64-with-glibc2.36
-----
Session information updated at 2024-08-19 02:43

scverse / anndata