kharchenkolab / Baysor

Bayesian Segmentation of Spatial Transcriptomics Data
https://kharchenkolab.github.io/Baysor/
MIT License
152 stars 31 forks source link

Running 3D Segmentation #111

Closed NickDiNapoli closed 3 weeks ago

NickDiNapoli commented 8 months ago

I am trying to obtain a 3D segmentation result from what I believe is the way to do so. I am not using PRIOR_SEGMENTATION in this run.

  1. Specify z column name in the MOLECULES_CSV via '-z' parameter
  2. Set force_2d = false in the config.toml file

The README documentation explains:

_segmentationpolygons.json: polygons used for visualization in GeoJSON format. In the case of 3D segmentation, it is an array of GeoJSON polygons per z-plane, as well as "joint" polygons. Shown only if --save-polygons=geojson is set.

The output GeoJSON only contains one set of 2D geometries instead of the expected output above. Can you please advise what aspect I am missing to obtain a 3D output?

Run parameters: ~/.julia/bin/baysor run \ 'transcripts.csv' \ -c 'example_config_3d.toml' \ -x 'global_x' -y 'global_y' -z 'global_z' -g 'gene' -m 20 -s 5 \ --save-polygons=geojson \ -p \ -o 'test'

Environment is: NAME="Amazon Linux AMI" VERSION="2018.03"

FalkoHof commented 7 months ago

I am running into a similar issue, which I highlighted in a bit more detail in this other issue here.

NickDiNapoli commented 7 months ago

@FalkoHof Thanks so much for linking your issue. It seems like your run got to the point where 3D segmentation was at least recognized. Out of curiosity, what did you do differently to at least get to that point, even if your run ultimately errored out because of the too many z values issue?

FalkoHof commented 7 months ago

The 3D segmentation was working for me irrespectively of using floats or ints on the z-axis. However, the json was only outputing 2D data. What kind of data are you using as input? I used Xenium data and followed this workflow (but deviated a bit from it due to changes in baysor and the xenium output). The modifications I made to the workflow were mostly to the main method of the filter_transcripts.py. script.

See below:

def main():
    # Parse input arguments.
    args = parse_args()

    data_frame = pd.read_csv(args.transcript)

    # Filter transcripts. Ignore negative controls
    filtered_frame = data_frame[(data_frame["qv"] >= args.min_qv) &
                                (data_frame["x_location"] >= args.min_x) &
                                (data_frame["x_location"] <= args.max_x) &
                                (data_frame["y_location"] >= args.min_y) &
                                (data_frame["y_location"] <= args.max_y) &
                                (~data_frame["feature_name"].str.startswith("NegControlProbe_")) &
                                (~data_frame["feature_name"].str.startswith("antisense_")) &
                                (~data_frame["feature_name"].str.startswith("NegControlCodeword_")) &
                                (~data_frame["feature_name"].str.startswith("BLANK_"))]
    # Xenium takes a lot of z-stacks and outputs a continous value for the z-location.Baysor wants a discrete z-stack value though.
    # As stacks are taken in 0.75 um intervals, we just round it to the next multiple of 0.75.
    filtered_frame["z_stack"] = np.rint(filtered_frame.loc[:,"z_location"] / 0.75).astype(np.int32)
    # Add also an additional column where we just round the z-value to the next int
    filtered_frame["z_location_int"] = np.rint(filtered_frame.loc[:,"z_location"]).astype(np.int32)

    # Baysor requires each cell id to be an integer. Newer Xenium ranger versions produce a letter based cell id though.
    # so we map the letter based cell ids to an integer range. Transcripts not assigned to a cell are labled "UNASSIGNED" in newer Xenium output and will be mapped to 0
    cell_ids = np.unique(filtered_frame.loc[:,"cell_id"])
    cell_ids_ints = np.arange(0,len(cell_ids))
    df_cell_ids = pd.DataFrame({"cell_id" : cell_ids, "cell_id_int" : cell_ids_ints}).set_index("cell_id")
    # here we just join the new integer cell ids into the filtered data frame
    filtered_frame = filtered_frame.join(df_cell_ids, how="left", on="cell_id")

    # Change cell_id of cell-free transcripts from "UNASSIGNED" to 0
    neg_cell_row = filtered_frame["cell_id"] == "UNASSIGNED"
    # add an additional column to inducate if the transcript overlaps the cell
    filtered_frame["overlaps_cell"] = 1
    filtered_frame.loc[neg_cell_row,"overlaps_cell"] = 0
    # also add another column if the transcript overlaps with the nucleus. As one cell contains only one nucleus in the Xenium segmentation we can use the cell int id also as nucleus id
    # We do this so we can use also just the nuclei segementations as input for baysor if we like (see https://github.com/kharchenkolab/Baysor?tab=readme-ov-file#using-a-prior-segmentation)
    filtered_frame["nucleus_id_int"] = filtered_frame.loc[:,"cell_id_int"]
    filtered_frame.loc[data_frame["overlaps_nucleus"] == 0, "nucleus_id_int"] = 0

    # Output filtered transcripts to CSV
    filtered_frame.to_csv('_'.join(["X"+str(args.min_x)+"-"+str(args.max_x), "Y"+str(args.min_y)+"-"+str(args.max_y), "filtered_transcripts.csv"]),
                          index=False,
                          encoding = 'utf-8')

@NickDiNapoli What is the error baysor is throwing in your case?

VPetukhov commented 2 months ago

Apologies for the late reply. This is a bug with polygon outputs. The segmentation is still done in 3D, but the polygons are not estimated properly. It will be fixed in the next release.

VPetukhov commented 3 weeks ago

This is fixed in v0.7.0. Please, see the changelog for more details.