Data Preparation - Githubissues

jw235 commented 9 months ago

Hi,

I've been exploring your repository and noticed that you've successfully used Xenium datasets in your library. I'm currently working with similar Xenium files and am interested in replicating your approach for data processing and segmentation.

Could you elaborate on how you prepared the Xenium data for use with Bering? Specifically, I'm facing a challenge in determining which files to utilize among the various H5, matrix, and CSV files (especially those that are zipped) for creating segmented and unsegmented data.

Understanding your method of data preparation would be immensely helpful, as I'm currently unsure which files to select for my project. Any insights or guidance you could provide would be greatly appreciated.

Thank you for your time and assistance!

KANG-BIOINFO commented 9 months ago

Hi,

Thank you for your interest in Bering. For Xenium data, we used the "transcripts.csv.gz" under the folder "outs" as the input, which should contain raw information about the 2d/3d spatial locations and gene identities of individual transcripts.

There are only a few columns required by Bering. You can refer to Understand Bering Object or Analyze Xenium Breast Cancer Data for more details.

Thanks, Kang

jw235 commented 9 months ago

Thanks for your quick reply. I have a couple more questions:

In the tutorial, different data are used, both segmented and unsegmented. However, in the original Xenium file, there's no information on segmentation or labels, which the Bering function obviously requires. Is there a standard method for creating segmentation and labels from the file? How does this work if one is not using tumor cells?
Which of the three TIF files is actually used?

Thanks a lot for your answer.

KANG-BIOINFO commented 9 months ago

They are good questions!

First, since the algorithm is based on supervised learning, we need to provide labels for some cells. The way I did it for brand new data is by selecting a small region, conducting an easy Watershed algorithm and manually annotating segmented cells. In the beginning, it is fine to draw coarse cell labels. After running a round of Bering using a small portion of input cells, you can get more cells segmented and annotated. This process can be conducted iteratively to reach a better segmentation and annotation performance.

For tif files, there is no constraint of which ones should be used. You can input just DAPI, or DAPI + others. This step may be a little slow, and we are working hard to accelerate it now.

jw235 commented 8 months ago

Thanks, for your reply. I got a few steps further in the meantime. And would have 2 more questions.

I used Qupath to define my labels, exported them as geojson and did a watershed trafo, to fill the segmented-column. My dataframe has the same structure, than the janesick dataset. I tried to create the Bering object, that did not work out using all 22k lines (probably doublets in the feature names). But I worked out, only using the dataframe head. For the unsegmented dataframe I used 5 features I defined as background.

How did you define the unset dataframe ?

Besides that I tried to run the complete code and got an error, which I am not able to solve by myself.

Do you have an idea, what the error means or how to solve the error ?

AttributeError Traceback (most recent call last) Cell In[36], line 2 1 # Build graphs for GCN training purpose ----> 2 br.graphs.BuildWindowGraphs( 3 bg, 4 n_cells_perClass = 12, 5 window_width = 15.0, 6 window_height = 15.0, 7 n_neighbors = 30, 8 ) 10 br.graphs.CreateData( 11 bg, 12 batch_size = 16, 13 training_ratio = 0.8, 14 ) 16 br.train.Training(bg)

File /usr/local/lib/python3.10/dist-packages/Bering/graphs/_loader.py:151, in BuildWindowGraphs(bg, n_cells_perClass, window_width, window_height, n_neighbors, min_points, use_unsegmented_ratio, max_unsegmented_thresh, cell_percentile_from_border, window_shift_ratio, n_windows_per_cell, min_spots_outside, **kwargs) 148 logger.info(f'Average number of filtered neighbors: {avg_neighbors:.2f} in the window') 150 bg.Graphs_golden = Graphs --> 151 logger.info(f'Number of node features: {bg.n_node_features}') 152 logger.info(f'\nTotal number of golden-truth graphs is {len(bg.Graphs_golden)}')

AttributeError: 'Bering_Graph' object has no attribute 'n_node_features'

cstrlln commented 4 months ago

@jw235 Did you manage to make this work?

jw235 commented 4 months ago

Unfortunately not, hopefully I find time to try again within the next weeks.

jian-shu-lab / Bering

Data Preparation #16