theislab / scib

Benchmarking analysis of data integration tools
MIT License
283 stars 62 forks source link

Error runing LISI when reading index file #374

Closed HelloWorldLTY closed 1 year ago

HelloWorldLTY commented 1 year ago

Hi, when I plan to run ilisi, I meet this error: ParserError: Error tokenizing data. C error: Expected 70 fields in line 6, saw 91

Here are the details: ``` ParserError Traceback (most recent call last) Input In [23], in () ----> 1 calculate_metric(adata, cor_list) Input In [9], in calculate_metric(adata, cor_list) 2 asw = calculate_common_asw(adata) 3 AUC = calculate_AUC(adata, cor_list) ----> 4 ilisi = calculate_iLISI(adata) 5 gc = calculate_graph_connectivity(adata) 7 percp = calculate_common_gene_propertion(adata) Input In [6], in calculate_iLISI(adata) 7 adata_new = adata[[True if i in common_gene_list else False for i in adata.obs['gene']]] 8 adata_new.obsm['X_emb'] = adata_new.X ---> 10 result = scib.metrics.ilisi_graph(adata_new, batch_key="tissue", type_="embed") 12 return result File /gpfs/ysm/project/zhao/tl688/conda_envs/scglue/lib/python3.8/site-packages/scib/metrics/lisi.py:77, in ilisi_graph(adata, batch_key, k0, type_, subsample, scale, n_cores, verbose) 74 check_batch(batch_key, adata.obs) 76 adata_tmp = recompute_knn(adata, type_) ---> 77 ilisi_score = lisi_graph_py( 78 adata=adata_tmp, 79 batch_key=batch_key, 80 n_neighbors=k0, 81 perplexity=None, 82 subsample=subsample, 83 n_cores=n_cores, 84 verbose=verbose, 85 ) 87 # iLISI: nbatches good, 1 bad 88 ilisi = np.nanmedian(ilisi_score) File /gpfs/ysm/project/zhao/tl688/conda_envs/scglue/lib/python3.8/site-packages/scib/metrics/lisi.py:285, in lisi_graph_py(adata, batch_key, n_neighbors, perplexity, subsample, n_cores, verbose) 282 simpson_estimate_batch = np.concatenate(results) 284 else: --> 285 simpson_estimate_batch = compute_simpson_index_graph( 286 file_prefix=prefix, 287 batch_labels=batch, 288 n_batches=n_batches, 289 perplexity=perplexity, 290 n_neighbors=n_neighbors, 291 ) 293 tmpdir.cleanup() 295 return 1 / simpson_estimate_batch File /gpfs/ysm/project/zhao/tl688/conda_envs/scglue/lib/python3.8/site-packages/scib/metrics/lisi.py:405, in compute_simpson_index_graph(file_prefix, batch_labels, n_batches, n_neighbors, perplexity, chunk_no, tol) 402 return lists 404 # read distances and indices with nan value handling --> 405 indices = pd.read_table(index_file, index_col=0, header=None, sep=",") 406 indices = indices.T 408 distances = pd.read_table(distance_file, index_col=0, header=None, sep=",") File /gpfs/ysm/project/zhao/tl688/conda_envs/scglue/lib/python3.8/site-packages/pandas/util/_decorators.py:311, in deprecate_nonkeyword_arguments..decorate..wrapper(*args, **kwargs) 305 if len(args) > num_allow_args: 306 warnings.warn( 307 msg.format(arguments=arguments), 308 FutureWarning, 309 stacklevel=stacklevel, 310 ) --> 311 return func(*args, **kwargs) File /gpfs/ysm/project/zhao/tl688/conda_envs/scglue/lib/python3.8/site-packages/pandas/io/parsers/readers.py:683, in read_table(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, dialect, error_bad_lines, warn_bad_lines, on_bad_lines, encoding_errors, delim_whitespace, low_memory, memory_map, float_precision) 668 kwds_defaults = _refine_defaults_read( 669 dialect, 670 delimiter, (...) 679 defaults={"delimiter": "\t"}, 680 ) 681 kwds.update(kwds_defaults) --> 683 return _read(filepath_or_buffer, kwds) File /gpfs/ysm/project/zhao/tl688/conda_envs/scglue/lib/python3.8/site-packages/pandas/io/parsers/readers.py:488, in _read(filepath_or_buffer, kwds) 485 return parser 487 with parser: --> 488 return parser.read(nrows) File /gpfs/ysm/project/zhao/tl688/conda_envs/scglue/lib/python3.8/site-packages/pandas/io/parsers/readers.py:1047, in TextFileReader.read(self, nrows) 1045 def read(self, nrows=None): 1046 nrows = validate_integer("nrows", nrows) -> 1047 index, columns, col_dict = self._engine.read(nrows) 1049 if index is None: 1050 if col_dict: 1051 # Any column is actually fine: File /gpfs/ysm/project/zhao/tl688/conda_envs/scglue/lib/python3.8/site-packages/pandas/io/parsers/c_parser_wrapper.py:224, in CParserWrapper.read(self, nrows) 222 try: 223 if self.low_memory: --> 224 chunks = self._reader.read_low_memory(nrows) 225 # destructive to chunks 226 data = _concatenate_chunks(chunks) File /gpfs/ysm/project/zhao/tl688/conda_envs/scglue/lib/python3.8/site-packages/pandas/_libs/parsers.pyx:801, in pandas._libs.parsers.TextReader.read_low_memory() File /gpfs/ysm/project/zhao/tl688/conda_envs/scglue/lib/python3.8/site-packages/pandas/_libs/parsers.pyx:857, in pandas._libs.parsers.TextReader._read_rows() File /gpfs/ysm/project/zhao/tl688/conda_envs/scglue/lib/python3.8/site-packages/pandas/_libs/parsers.pyx:843, in pandas._libs.parsers.TextReader._tokenize_rows() File /gpfs/ysm/project/zhao/tl688/conda_envs/scglue/lib/python3.8/site-packages/pandas/_libs/parsers.pyx:1925, in pandas._libs.parsers.raise_parser_error() ParserError: Error tokenizing data. C error: Expected 70 fields in line 6, saw 91 ```

Could you please help me? Thanks a lot. I do not know if it is caused by the version of pandas or nan value in the result.

mumichae commented 1 year ago

Hi, which version of scib are you working with? Does this happen with older versions as well?

mumichae commented 1 year ago

@lazappi @HelloWorldLTY What versions of pandas are you using? Did you update the environment when things started to break or was this in an old environment?

lazappi commented 1 year ago

I'm using pandas v1.4.3 and scib v1.1.1 but I don't think the issue is related to versions.

The dataset I am using is tiny (~200 cells). What I think is happening is that because there are so few cells the function isn't able to find k0 neighbours for each cell. What actually causes the error is that the number of neighbours that can be found is different for some cells than others (maybe because there are two components?) which means that the rows in the index file have different lengths (you can see that in the file in #376). pandas then fails to read this file later on. I have semi-confirmed this by running my code on other datasets without any issues and successfully running LISI on the tiny dataset with k0 set to a lower value.

For my use case I can just adjust k0 based on the number of cells but it would be better to have a proper solution. Things I can think of:

There are probably other options as well.

HelloWorldLTY commented 1 year ago

Hi, I tried to decrease k0 and I did not receive this error again. Are there any method can set k0 to default adaptive version? Thanks a lot.

lazappi commented 1 year ago

This came up again so I looked into it a bit more. It's common for the less than k0 neighbours to be found for some cells (if the graph isn't fully connected) and the following code handles this. The error happens because sometimes pandas gets the number of columns in the file wrong (it picks the lower number instead) and then it fails to read the rows with the correct number of neighbours.

I think the easiest long-term solution (without changing the C++ code) is just to make sure that pandas knows how many columns the file should have by setting the names argument, something like:

pd.read_table(index_file, index_col=0, header=None, sep=",", names=["index"] + list(range(1, n_neighbors + 1)))

This just produces NaN values which other functions know how to handle anyway.

My temporary workaround is to cells belonging to components with fewer than k0 cells from the object before passing it to the metric function. I'm not a big fan of this because it's a lot of extra processing and it changes the metric value slightly but it's the best I could come up with without modifying the package.

from scib.metrics import ilisi_graph
from scanpy.preprocessing import neighbors
from igraph import Graph

# Remove cells with fewer than k0 neighbours to avoid theislab/scib/issues/374

# Calculate the neighbourhood graph using same settings as the metric
neighbors(adata, n_neighbors=15, use_rep="X_emb")

# Create a graph object from the neighbourhood graph
graph = Graph.Weighted_Adjacency(adata.obsp["connectivities"])

# Get the connected components
components = graph.connected_components()
component_sizes = [len(component) for component in components]

# Check which components have fewer than k0 cells
connected = []
n_unconnected = 0
for idx in range(len(components)):
    if component_sizes[idx] >= k0:
        connected += components[idx]
    else:
        n_unconnected += component_sizes[idx]

if n_unconnected > 0:
    from warnings import warn
    warn(f"Found {n_unconnected} cells with fewer than {k0} neighbours. These cells will be skipped.")

# Delete the neighbourhood graph so it's not used by the metric
del adata.uns["neighbors"]
del adata.obsp["distances"]
del adata.obsp["connectivities"]

print("Calculating qLISI score...")
score = ilisi_graph(
    adata[connected, :],
    batch_key="Batch",
    type_="embed",
    use_rep="X_emb",
    k0=k0,
    subsample=None,
    scale=True,
    verbose=True,
)