nolanlab / spade

SPADE: Spanning Tree Progression of Density Normalized Events
Other
46 stars 23 forks source link

Using converted FCS files #128

Closed gdreiman1 closed 8 years ago

gdreiman1 commented 8 years ago

I'm trying to use SPADE on a small data set derived from a single cell barcode chip. The data set contains data from 112 cells with 26 marker values. Our data is natively in an Excel format, so we converted it to FCS with some MatLab code. I have been using the following settings to attempt to account for out small data size: files = "input_dirdrug2"; file_pattern = "*.fcs"; out_dir = "output_dirdrug2"; cluster_cols = NULL; panels = NULL; comp = TRUE; transforms = "transforms"; downsampling_target_number = NULL; downsampling_target_pctile = NULL; downsampling_target_percent = 1; downsampling_exclude_pctile = 0.01; k = 10;clustering_samples = 100; layout = layout; pctile_color = c(0.02, 0.98);

However, when I run SPADE.driver, I get the following errors:

SPADE.driver(files='input_dirdrug2', out_dir='output_dirdrug2', transforms='transforms')
the text section does not end with delimiter: \\. The last keyword is dropped.
the text section does not end with delimiter: \\. The last keyword is dropped.
Downsampling file: input_dirdrug2/drug.fcs
the text section does not end with delimiter: \\. The last keyword is dropped.
the text section does not end with delimiter: \\. The last keyword is dropped.
  Estimated downsampling-I progress:  0% ...
  Estimated downsampling-I progress: 100% ...
the text section does not end with delimiter: \\. The last keyword is dropped.
the text section does not end with delimiter: \\. The last keyword is dropped.
Targeting 11 events for output_dirdrug2/drug.fcs.density.fcs
Clustering files...
Error in SPADE.cluster(SPADE.transform.matrix(data, transforms), k) : 
  Number of requested clusters exceeds number of events

I think there are two problems here. First, the FCS file is not properly formatted so SPADE.driver cannot read the entire file. Second, the data set is small so I need to be careful about choosing my k and clustering_samples values.

I have a few questions about these two problems. Is it possible to use other file formats besides FCS? If not, what is the best way to convert to FCS so that I get the proper file formatting? What would be appropriate values to use with a small data set like mine?

zbjornson commented 8 years ago

Some of the discussion in #122 might be helpful to you.

Is it possible to use other file formats besides FCS?

While you could modify the package to do so, it is by far easier to convert your data into FCS format.

what is the best way to convert to FCS so that I get the proper file formatting?

You could put something together in R, Mathematica or Matlab fairly easily (although I'm not sure why Matlab created an invalid file), but if you want to try task-specific packages, look at CsvToFcs from the Broad Institute's GenePattern Flow Cytometry Module: http://software.broadinstitute.org/cancer/software/genepattern/flow-cytometry-data-preprocessing You can run that from the web site for free even, starting by registering here.

What would be appropriate values to use with a small data set like mine?

112 cells can easily be processed without downsampling. In your call to SPADE.driver, set downsampling_target_percent=0.99 and it will hopefully work.

All that said, SPADE without the downsampling is standard hierarchical clustering followed by layout in an MST. Getting the downsampling to "turn off" is sort of tricky. While you're logged in to GenePattern, you could check out their clustering modules to see if you find something interesting, or run hclust in R or an equivalent in your favorite language.

gdreiman1 commented 8 years ago

I have looked at #122 in the past, it seems that I am having similar problems.

While you could modify the package to do so, it is by far easier to convert your data into FCS format. You could put something together in R, Mathematica or Matlab fairly easily (although I'm not sure why Matlab created an invalid file), but if you want to try task-specific packages, look at CsvToFcs from the Broad Institute's GenePattern Flow Cytometry Module: http://software.broadinstitute.org/cancer/software/genepattern/flow-cytometry-data-preprocessing You can run that from the web site for free even, starting by registering here.

I've actually tried this before (along with several other programs) without much success. I just used the Broad Institute program to convert a sample CSV of our data, and got the following error once I rand Spade on the resulting FCS file.

SPADE.driver(files='input_dirdrug2',` out_dir='output_dirdrug2', transforms='transforms')
the text section does not end with delimiter: \|. The last keyword is dropped.
the text section does not end with delimiter: \|. The last keyword is dropped.
Error in readFCSgetPar(x, "$BYTEORD") : 
  Parameter(s) $BYTEORD not contained in 'x'

It seems like maybe I'm not formatting the initial CSV file correctly, which is yielding an improper FCS file. This is strange because I have been able to run other packages that take FCS files using my converted files. Do you have any suggestions for fixing this? Is there a way that I can view the layout of the sample data provided in the Nature Protocols paper so that I can copy that in my converted files? I'm not sure what the proper program for viewing and editing FCS files would be.

All that said, SPADE without the downsampling is standard hierarchical clustering followed by layout in an MST. Getting the downsampling to "turn off" is sort of tricky. While you're logged in to GenePattern, you could check out their clustering modules to see if you find something interesting, or run hclust in R or an equivalent in your favorite language.

We are interested in the MST portion of SPADE. Would it be relatively easy to run igraph's minimum.spanning.tree function on the output of hclust to mimic SPADE's output? Or is there another way to yield similar graphics without involving FCS files?

SamGG commented 8 years ago

HI, I don't get exactly your goal in using SPADE. It seems to me that you have only ONE data table of 112 x 26 values. If so, I think you'd better try a standard hierarchical clustering, PCA, MDS or a supervised analysis such as SAM or limma. If you aim is to display a phylogeny-like tree from a clustering algorithm or a distance matrix, I think that the vegan package could help http://cc.oulu.fi/~jarioksa/softhelp/vegan/html/spantree.html. The igraph you mentioned is also helpful, but the ape package also. I think you could find a quicker answer than trying to convert your table to run SPADE. If you are still interested in SPADE, I think you should start looking at https://github.com/nolanlab/spade/blob/master/R/driver.R#L113-L139 and https://github.com/nolanlab/spade/blob/master/R/cluster.R for the computation part of the algorithm. If the display part is also important, look at the driver.R file. Alternatively, once you got a tree (hclust, MST...) or a distance matrix, you could try to export it with nodes attributes and use Cytoscape. HTH

gdreiman1 commented 8 years ago

@SamGG Thanks for the suggestions! I'll look into the vegan package, that seems like it might be a much more simple solution.

Also, for future reference, I did resolve the CSV to FCS file conversion issue. In order to make the FCS file work with SPADE, you need to generate a CSV file with column labels but no row labels. That CSV can be converted to FCS with the Broad Institute's GenePattern Flow Cytometry Module and the resulting file runs in SPADE with no errors.

zbjornson commented 8 years ago

Thanks @SamGG for the input.

@gdreiman1 I'll close this issue as it sounds like you have a path forward. Feel free to open another issue or comment here and I'll re-open it.

SamGG commented 8 years ago

@gdreiman1 Thanks for the tip. The write.csv usually needs to mute the raw names row.names = F. @zbjornson thanks for taking care of issues.