Closed HelloWorldLTY closed 1 year ago
Hi, which version of scib are you working with? Does this happen with older versions as well?
@lazappi @HelloWorldLTY What versions of pandas are you using? Did you update the environment when things started to break or was this in an old environment?
I'm using pandas v1.4.3 and scib v1.1.1 but I don't think the issue is related to versions.
The dataset I am using is tiny (~200 cells). What I think is happening is that because there are so few cells the function isn't able to find k0
neighbours for each cell. What actually causes the error is that the number of neighbours that can be found is different for some cells than others (maybe because there are two components?) which means that the rows in the index file have different lengths (you can see that in the file in #376). pandas then fails to read this file later on. I have semi-confirmed this by running my code on other datasets without any issues and successfully running LISI on the tiny dataset with k0
set to a lower value.
For my use case I can just adjust k0
based on the number of cells but it would be better to have a proper solution. Things I can think of:
k0
neighbours can be found for each cell (and throwing an error if not)k0
to whatever the lower number found is (with a warning)There are probably other options as well.
Hi, I tried to decrease k0 and I did not receive this error again. Are there any method can set k0 to default adaptive version? Thanks a lot.
This came up again so I looked into it a bit more. It's common for the less than k0
neighbours to be found for some cells (if the graph isn't fully connected) and the following code handles this. The error happens because sometimes pandas gets the number of columns in the file wrong (it picks the lower number instead) and then it fails to read the rows with the correct number of neighbours.
I think the easiest long-term solution (without changing the C++ code) is just to make sure that pandas knows how many columns the file should have by setting the names
argument, something like:
pd.read_table(index_file, index_col=0, header=None, sep=",", names=["index"] + list(range(1, n_neighbors + 1)))
This just produces NaN
values which other functions know how to handle anyway.
My temporary workaround is to cells belonging to components with fewer than k0
cells from the object before passing it to the metric function. I'm not a big fan of this because it's a lot of extra processing and it changes the metric value slightly but it's the best I could come up with without modifying the package.
from scib.metrics import ilisi_graph
from scanpy.preprocessing import neighbors
from igraph import Graph
# Remove cells with fewer than k0 neighbours to avoid theislab/scib/issues/374
# Calculate the neighbourhood graph using same settings as the metric
neighbors(adata, n_neighbors=15, use_rep="X_emb")
# Create a graph object from the neighbourhood graph
graph = Graph.Weighted_Adjacency(adata.obsp["connectivities"])
# Get the connected components
components = graph.connected_components()
component_sizes = [len(component) for component in components]
# Check which components have fewer than k0 cells
connected = []
n_unconnected = 0
for idx in range(len(components)):
if component_sizes[idx] >= k0:
connected += components[idx]
else:
n_unconnected += component_sizes[idx]
if n_unconnected > 0:
from warnings import warn
warn(f"Found {n_unconnected} cells with fewer than {k0} neighbours. These cells will be skipped.")
# Delete the neighbourhood graph so it's not used by the metric
del adata.uns["neighbors"]
del adata.obsp["distances"]
del adata.obsp["connectivities"]
print("Calculating qLISI score...")
score = ilisi_graph(
adata[connected, :],
batch_key="Batch",
type_="embed",
use_rep="X_emb",
k0=k0,
subsample=None,
scale=True,
verbose=True,
)
Hi, when I plan to run ilisi, I meet this error: ParserError: Error tokenizing data. C error: Expected 70 fields in line 6, saw 91
Here are the details:
``` ParserError Traceback (most recent call last) Input In [23], inCould you please help me? Thanks a lot. I do not know if it is caused by the version of pandas or nan value in the result.