mortazavilab / swan_vis

A Python library to visualize and analyze long-read transcriptomes
https://freese.gitbook.io/swan/
MIT License
57 stars 11 forks source link

Index contains duplicate entries, cannot reshape #27

Closed rugilemat closed 7 months ago

rugilemat commented 8 months ago

Hi,

Thanks for a gorgeous tool!

I've been trying to trial swan on my samples but I seem to be encountering this error:

Adding annotation to the SwanGraph

Adding transcriptome to the SwanGraph
/users/k19022845/.local/lib/python3.8/site-packages/anndata/_core/anndata.py:1830: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.
  utils.warn_names_duplicates("var")

Adding abundance for datasets NPCBC01, NPCBC02, NPCBC03, NPCBC04, NPCBC05... (and 31 more) to SwanGraph
Calculating TPM...
Calculating PI...
Traceback (most recent call last):
  File "swan_trial.py", line 17, in <module>
    sg.add_abundance(ab_file)
  File "/users/k19022845/.local/lib/python3.8/site-packages/swan_vis/swangraph.py", line 571, in add_abundance
    self.merge_adata_abundance(adata, how=how)
  File "/users/k19022845/.local/lib/python3.8/site-packages/swan_vis/swangraph.py", line 424, in merge_adata_abundance
    sg_adata.layers['pi'] = sparse.csr_matrix(calc_pi(sg_adata, self.t_df)[0].to_numpy())
  File "/users/k19022845/.local/lib/python3.8/site-packages/swan_vis/utils.py", line 427, in calc_pi
    df = df.pivot(columns=obs_col, index=id_col, values='pi')
  File "/users/k19022845/.local/lib/python3.8/site-packages/pandas/util/_decorators.py", line 331, in wrapper
    return func(*args, **kwargs)
  File "/users/k19022845/.local/lib/python3.8/site-packages/pandas/core/frame.py", line 8567, in pivot
    return pivot(self, index=index, columns=columns, values=values)
  File "/users/k19022845/.local/lib/python3.8/site-packages/pandas/util/_decorators.py", line 331, in wrapper
    return func(*args, **kwargs)
  File "/users/k19022845/.local/lib/python3.8/site-packages/pandas/core/reshape/pivot.py", line 540, in pivot
    return indexed.unstack(columns_listlike)  # type: ignore[arg-type]
  File "/users/k19022845/.local/lib/python3.8/site-packages/pandas/core/series.py", line 4455, in unstack
    return unstack(self, level, fill_value)
  File "/users/k19022845/.local/lib/python3.8/site-packages/pandas/core/reshape/reshape.py", line 489, in unstack
    unstacker = _Unstacker(
  File "/users/k19022845/.local/lib/python3.8/site-packages/pandas/core/reshape/reshape.py", line 137, in __init__
    self._make_selectors()
  File "/users/k19022845/.local/lib/python3.8/site-packages/pandas/core/reshape/reshape.py", line 189, in _make_selectors
    raise ValueError("Index contains duplicate entries, cannot reshape")
ValueError: Index contains duplicate entries, cannot reshape

I'm not entirely sure where the duplicate entry issue is coming from, so any advice on that would be great!

fairliereese commented 8 months ago

Hi, thanks for your kind words :)

I seem to recall debugging a similar problem for myself semi-recently. Can you tell me if you're using the latest commits from GitHub? If not, would you mind trying that?

rugilemat commented 8 months ago

Yes, it should be the latest commit.

fairliereese commented 8 months ago

OK, based on the warning you're getting from AnnData, it would appear that you have some duplicated transcript IDs, likely in your abundance matrix, judging by where the error is being thrown. To test this, please run one of the following code blocks in Python, depending on what format your data is in:

If you're using a TALON abundance file

import pandas as pd
df = pd.read_csv('<your abundance file>', sep='\t')
print(df.loc[df.annot_transcript_id.duplicated(keep=False)].sort_values(by='annot_transcript_id'))

If you're using the non-specific formatted abundance file:

import pandas as pd
df = pd.read_csv('<your abundance file>', sep='\t')
print(df.loc[df[df.columns[0]].duplicated(keep=False)].sort_values(by=[df.columns[0]]))

If this prints any data, you have duplicated transcript IDs in your dataset which you must address. Let me know if this helps you solve the problem, or if this code doesn't run for you (I did not test it).

rugilemat commented 7 months ago

Thanks for this and sorry for the delay - they have been updating our HPC and it's been a pain to get any jobs run. It seems this sorted the issue out - thanks!