linnarsson-lab / FISHscale

Spatial analysis of FISH data
27 stars 9 forks source link

Error in tutorial notebook #26

Open dpshepherd opened 1 year ago

dpshepherd commented 1 year ago

Hi all,

Thank you for this amazing tool and resource!

I had to make a few modifications to get it installed on Windows 11 with Python 3.9 I created and activated a new environment,

conda create -n fishscale python=3.9 
conda activate fishscale

Then I installed geopandas and jax manually

conda install geopandas
pip install "jax[cpu]===0.3.14" -f https://whls.blob.core.windows.net/unstable/index.html --use-deprecated legacy-resolver

Then I was able to install fishscale from inside the repo folder pip install -e .

However, when I load up the multi-dataset tutorial, I get the following error on the Load Data block

FileNotFoundError                         Traceback (most recent call last)
Cell In[5], line 4
      1 #Collect files
      2 color_dict = pickle.load(open(data_path + 'Mouse_atlas_168_color_dict.pkl', 'rb'))
----> 4 md = dataset.MultiDataset(data_path, 
      5                           x_label='c_px_microscope_stitched', 
      6                           y_label= 'r_px_microscope_stitched',
      7                           gene_label = 'decoded_genes', 
      8                           pixel_size='0.18 micrometer',
      9                           select_valid=True,
     10                           color_input=color_dict, 
     11                           reparse=True, 
     12                           unique_genes=None, 
     13                           verbose=True, 
     14                           exclude_genes=['Control1', 'Control2', 'Control3', 'Control4', 'Control5','Control6', 'Control7', 'Control8'],
     15                           z=[-140, 600, 1200, 1810, 2420, 3000, 3600])
     17 # Have olfactory bulb pointing left.
     18 for mdd in md.datasets:

File c:\users\dpshe\documents\github\fishscale\FISHscale\utils\dataset.py:673, in MultiDataset.__init__(self, data, data_folder, unique_genes, MultiDataset_name, color_input, verbose, grid_layout, columns_layout, x_label, y_label, gene_label, other_columns, exclude_genes, z, pixel_size, x_offset, y_offset, z_offset, polygon, select_valid, reparse, parse_num_threads)
    671     if parse_num_threads == -1 or parse_num_threads > self.cpu_count:
    672         parse_num_threads = self.cpu_count
--> 673     self.load_from_files(data, x_label, y_label, gene_label, other_columns, unique_genes, exclude_genes, z, 
    674                          pixel_size, x_offset, y_offset, z_offset, polygon, select_valid, reparse, color_input, 
...
     88 file_name = path.join(self.FISHscale_data_folder, f'{self.dataset_name}_metadata.pkl')
---> 89 with open(file_name, 'wb') as pf:
     90     pickle.dump(data_dict, pf)

FileNotFoundError: [Errno 2] No such file or directory: 'C:\\Users\\dpshe\\Downloads\\20324814\\211114_08_40_50_LBEXP20200925_EEL_Mouse_-140um_data_summary_simple_plotting_cleaned_microscope_stitched_FISHscale_Data\\211114_08_40_50_LBEXP20200925_EEL_Mouse_-140um_data_summary_simple_plotting_cleaned_microscope_stitched_metadata.pkl'

The directory is created, but no .pkl file is added. Any suggestions on how to get it running?

Thanks! Doug

larsborm commented 1 year ago

Hi Doug, Thank you for the interest in FISHscale!

It is weird that it can not make the pickle files. I have now changed the code to remove the with open... part and I hope this fixes it. If not, could you check for me if you can make pickle files in the folder you are working in, with:

import pickle
a = ['hello']
pickle.dump(a, open('test.pkl', 'wb'))

And open them again:

print(pickle.load(open('test.pkl', 'rb')))

I'm also setting up a Windows box to test myself.

Just a side note, most FISHscale functions are intended for 2D stuff. But after reading that you are interested in using it, I have now added the functionality to load Z information for each RNA molecule. You can include the Z coordinates in your .parquet datafile, and then load it by providing the correct 'Z_label' column name. It will be ignored for most functions but the Open3D viewer will display it (use the .visualize() function).

Please let me know if you run into more issues!

dpshepherd commented 1 year ago

Thanks!! It is awesome that you added the 3D capability. We are really struggling to plot all of the points over cell outlines using Napari. We are hoping to try out FISHscale to visualize our 3D MERFISH data.

The .pkl write and read test code you provided works in the directory where I have stored the data, so it doesn't appear to be a permission issue.

I pulled and re-installed FISHscale, wrapped the strings for the filenames in Path objects to ensure there are no WindowsPath issues, and now the example_notebooks\FISHscale_tutorial_single_dataset.ipynb works. That is enough for us to move forward. I will ask the group to convert some data in the next week and give it a shot with our own data.

For the example_notebooks\FISHscale_tutorial_multi_dataset.ipynb, I now get an error that is a bit more informative:

---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
Cell In[5], line 4
      1 #Collect files
      2 color_dict = pickle.load(open(data_path / Path('Mouse_atlas_168_color_dict.pkl'), 'rb'))
----> 4 md = dataset.MultiDataset(str(data_path), 
      5                           x_label='c_px_microscope_stitched', 
      6                           y_label= 'r_px_microscope_stitched',
      7                           gene_label = 'decoded_genes', 
      8                           pixel_size='0.18 micrometer',
      9                           select_valid=True,
     10                           color_input=color_dict, 
     11                           reparse=True, 
     12                           unique_genes=None, 
     13                           verbose=True, 
     14                           exclude_genes=['Control1', 'Control2', 'Control3', 'Control4', 'Control5','Control6', 'Control7', 'Control8'],
     15                           z=[-140, 600, 1200, 1810, 2420, 3000, 3600])
     17 # Have olfactory bulb pointing left.
     18 for mdd in md.datasets:

File [c:\users\dpshe\documents\github\fishscale\FISHscale\utils\dataset.py:686](file:///C:/users/dpshe/documents/github/fishscale/FISHscale/utils/dataset.py:686), in MultiDataset.__init__(self, data, data_folder, unique_genes, MultiDataset_name, color_input, verbose, grid_layout, columns_layout, x_label, y_label, z_label, gene_label, other_columns, exclude_genes, z, pixel_size, x_offset, y_offset, z_offset, polygon, select_valid, reparse, parse_num_threads)
    684     if parse_num_threads == -1 or parse_num_threads > self.cpu_count:
    685         parse_num_threads = self.cpu_count
--> 686     self.load_from_files(data, x_label, y_label, z_label, gene_label, other_columns, unique_genes, exclude_genes, z, 
    687                          pixel_size, x_offset, y_offset, z_offset, polygon, select_valid, reparse, color_input, 
    688                          parse_num_threads)
    689 else:
    690     raise Exception(f'Input for "data" not understood. Should be list with initiated Datasets or valid path to files.')

File [c:\users\dpshe\documents\github\fishscale\FISHscale\utils\dataset.py:865](file:///C:/users/dpshe/documents/github/fishscale/FISHscale/utils/dataset.py:865), in MultiDataset.load_from_files(self, filepath, x_label, y_label, z_label, z, gene_label, other_columns, unique_genes, exclude_genes, pixel_size, x_offset, y_offset, z_offset, polygon, select_valid, reparse, color_input, num_threads)
    863         if not ug_success: 
    864             open_f = self._open_data_function(files[0])
--> 865             all_genes = open_f(files[0], [gene_label])
    866             self.unique_genes = np.unique(all_genes)
    868 #Open the files with the option to do this in paralell.

File [c:\users\dpshe\documents\github\fishscale\FISHscale\utils\data_handling.py:62](file:///C:/users/dpshe/documents/github/fishscale/FISHscale/utils/data_handling.py:62), in DataLoader_base._open_data_function..open_f(f, columns)
     59     warnings.warn(f"""Could not open the following columns: {invalid_columns}. Ignore if not required.
     60                   Otherwise choose from: {existing_columns}""")
     61 if critical_columns != []:
---> 62     raise Exception(f'Could not open critical columns: {critical_columns}. Choose from: {existing_columns}')
     64 try:
     65     return pd.read_parquet(f, columns = valid_columns)

Exception: Could not open critical columns: [[]]. Choose from: ['fov_num', 'r_px_microscope_stitched', 'c_px_microscope_stitched', 'decoded_genes', 'Valid']

However, whenveer I look into one of the parquet files, there is definitely data:

>>> import pandas as pd
>>>pd.read_parquet('211114_08_40_50_LBEXP20200925_EEL_Mouse_-140um_data_summary_simple_plotting_cleaned_microscope_stitched.parquet')
         fov_num  r_px_microscope_stitched  c_px_microscope_stitched decoded_genes  Valid
0              1              -1211.000000              52359.148535         Stmn2      0
1              1               -672.000000              52392.148535          Pax5      1
2              1               -498.000000              51522.148535         Calb2      1
3              1               -743.000000              51167.148535         Kcnj8      0
4              1               -900.000000              52630.148535          Aqp4      1
...          ...                       ...                       ...           ...    ...
3426472      757              37277.958549              62600.984457          Npnt      0
3426473      757              36305.958549              62989.984457          Npnt      0
3426474      757              37371.958549              62048.984457          Npnt      0
3426475      757              36614.958549              63277.984457          Npnt      0
3426476      757              36824.958549              62601.984457          Npnt      0

[3426477 rows x 5 columns]

Given that the single dataset example works, we should be good to go!

larsborm commented 1 year ago

Hi, Sorry about that. I accidentally introduced the critical column error. I have fixed it, could you please try again?

The paths are puzzling me though. Because the path in your first message seems properly formatted for Windows. Let me know if it remains an issue.

Open3D is fantastic for large datasets. It is really fast and hope you like it. All credits to @Mossi8 for building the awesome viewer in FISHscale! Cell outlines will unfortunately not work though, but there is an Open3D line class, so it could possibly be added. Please let me know if everything works now.

dpshepherd commented 1 year ago

Just getting back to testing this. Do you have a tutorial for correctly importing a CSV file?

We are running into an error when we attempt to load a .csv instead of a .parquet. The error appears to happen at this line: https://github.com/linnarsson-lab/FISHscale/blob/caefc60b925d137c90e44868af4b163fe95c7d6c/FISHscale/utils/data_handling.py#L493

The CSV file has columns of x, y, z, gene_id that we provide to the loader via:

d = dataset.Dataset(file_name,
                     x_label = 'x',
                     y_label = 'y',
                     z_label = 'z',              
                     gene_label = 'gene_id',
                     pixel_size = '1 micrometer',
                     unique_genes=None, 
                     verbose = True,
                     reparse = True)

The xyz coordinates are already in micrometers in the CSV file.

The error is:

Traceback (most recent call last):
  File "C:\Users\qi2lab\miniconda3\envs\fishscale\lib\site-packages\dask\dataframe\groupby.py", line 2856, in __getattr__     
    return self[key]
  File "C:\Users\qi2lab\miniconda3\envs\fishscale\lib\site-packages\dask\dataframe\groupby.py", line 2842, in __getitem__     
    g._meta = g._meta[key]
  File "C:\Users\qi2lab\miniconda3\envs\fishscale\lib\site-packages\pandas\core\groupby\generic.py", line 1771, in __getitem__
    return super().__getitem__(key)
  File "C:\Users\qi2lab\miniconda3\envs\fishscale\lib\site-packages\pandas\core\base.py", line 244, in __getitem__
    raise KeyError(f"Column not found: {key}")
KeyError: 'Column not found: progress_apply'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "c:/Users/qi2lab/Documents/GitHub/FISHscale/example_notebooks/test_import.py", line 21, in <module>
    d = dataset.Dataset(file_name,
  File "c:\users\qi2lab\documents\github\fishscale\FISHscale\utils\dataset.py", line 203, in __init__
    self.load_data(self.filename, x_label, y_label, gene_label, self.other_columns, x_offset, y_offset, z_offset,
  File "c:\users\qi2lab\documents\github\fishscale\FISHscale\utils\data_handling.py", line 494, in load_data
    data.groupby('g').progress_apply(lambda x: self._dump_to_parquet(x, self.dataset_name, self.FISHscale_data_folder))#, meta=('float64')).compute()
  File "C:\Users\qi2lab\miniconda3\envs\fishscale\lib\site-packages\dask\dataframe\groupby.py", line 2858, in __getattr__
    raise AttributeError(e) from e
AttributeError: 'Column not found: progress_apply'

Thanks!

larsborm commented 1 year ago

Hi Doug, The problem should be fixed. I'm crossing my fingers that everything works for you now!

dpshepherd commented 1 year ago

Success! This is a few millimeters square of human lung tissue with 30 micron depth.

image

There are some oddities on browsing in 3D, but this may be related to the fact our coordinates are not centered around (0,0,0). We will center the spots and try again.

Thank you!! I anticipate the group, especially @AlexCoul, will make heavy use of this.

larsborm commented 1 year ago

Fantastic! That looks great! To center the dataset you can use:

x_center, y_center, z_center = d.xyz_center
d.offset_data_temp(x_offset = -x_center, y_offset = -y_center, z_offset = -z_center)

I made some improvements for the Z dimension so please pull again. Just note that this centering does not change the files on disk.

Mossi8 commented 1 year ago

Great to see it working!!

I you are going to use it, maybe I can explain a few things:

The scroll bar for the genes gets very weird if you go down passed half of your gene list (you can't see the bottom of the list on very long gene lists). That is a weird open3d issue I fixed by adding the arrow on the search bar, I you hide the search bar, that scroll bar issue is gone.

And second hidden thing, simply that you can use the keyboard arrows to change genes. If it does not work, click anywhere on the visualiser and try the arrows again.

Anything else, let us know.

nhuytan commented 1 month ago

I have some problem when loading dataset.

image image

Any suggestion to fix that bug ?