linnarsson-lab / loom-viewer

Tool for sharing, browsing and visualizing single-cell data stored in the Loom file format
BSD 2-Clause "Simplified" License
35 stars 6 forks source link

Export selected genes to CSV #138

Open JobLeonard opened 6 years ago

JobLeonard commented 6 years ago

So this turns out to be quite trivial from an export code point of view, actually. Adding the interface buttons is probably going to be more work!

The bigger question is what data to include, aside from the gene expressions. I figure

But maybe it would also make sense to include CellID and Accession. That would result in something like:

[label attribute] [labelval1] [labelval2] [labelval2] [...]
CellId [cellid1] [cellid2] [cellid3] [...]
Gene Accession
[gene1] [accession1] 0.0 0.3 0.0 ...
[gene2] [accession2] 0.1 0.4 0.0 ...
[gene3] [accession3] 0.0 0.0 0.0 ...
[...] [...] ... ... ...

Is this a logical format? @slinnarsson, @simone-codeluppi, @pl-ki, @gioelelm.

mschilli87 commented 6 years ago

This sounds like a great addition to the viewer.

For what it's worth, I'd rather like to see the data in long table format (each column corresponding to an observable and each row to an observation):

Gene Accession CellID [label attribute] expression
[gene1] [accession1] [cellid1] [labelval1] 0.0
[gene1] [accession1] [cellid2] [labelval2] 0.3
[gene1] [accession1] [cellid3] [labelval3] 0.0
[gene1] [accession1] [...] [...] ...
[gene2] [accession2] [cellid1] [labelval1] 0.1
[gene2] [accession2] [cellid2] [labelval2] 0.4
[gene2] [accession2] [cellid3] [labelval3] 0.0
[gene2] [accession2] [...] [...] ...
[gene3] [accession3] [cellid1] [labelval1] 0.0
[gene3] [accession3] [cellid2] [labelval2] 0.0
[gene3] [accession3] [cellid3] [labelval3] 0.0
[gene3] [accession3] [...] [...] ...
[...] [...] [...] [...] ...

While this drops the matrix-style it is easy to extract the same information but parsing these kinds of data is easier with many tools in my experience, plus its easier to modify (extend) the format later on (by adding extra columns in the end) without breaking existing code (Just imagine you want to include another attribute: Your format would need an extra line on that parsers would need to eb aware of).

Also, while some information is repeated (CellIDs, labelvals), TSV compresses nicely, so I don't see any drawbacks.

Obviously, any kind of TSV export would be great and I could transform the data myself if you disagree.

JobLeonard commented 6 years ago

I specifically opened the issue because I had no idea what the more logical export form would be, so your input is greatly appreciated! :)

If I understand correctly, each [expression] column would be a gene?

I'll await some more feedback before deciding how to implement this

mschilli87 commented 6 years ago

If I understand correctly, each [expression] column would be a gene?

No, there would be just one expression column.

In my example above, I sorted the table by gene but sorted by cell it would be like this:

Gene Accession CellID [label attribute] expression
[gene1] [accession1] [cellid1] [labelval1] 0.0
[gene2] [accession2] [cellid1] [labelval1] 0.1
[gene3] [accession3] [cellid1] [labelval1] 0.0
[...] [...] [cellid1] [labelval1] ...
[gene1] [accession1] [cellid2] [labelval2] 0.3
[gene2] [accession2] [cellid2] [labelval2] 0.4
[gene3] [accession3] [cellid2] [labelval2] 0.0
[...] [...] [cellid2] [labelval2] ...
[gene1] [accession1] [cellid3] [labelval3] 0.0
[gene2] [accession2] [cellid3] [labelval3] 0.0
[gene3] [accession3] [cellid3] [labelval3] 0.0
[...] [...] [cellid3] [labelval3] ...
[...] [...] [...] [...] ...

basically, the columns of the matrix are concatenated into a long vector (expression column) and the column names (cellids) are added as an extra column (CellID) column.

JobLeonard commented 6 years ago

Oh, I see now! Yes, this would lead to very large files. But I see what you mean with convenience.

Also, all data that would be exported has already been downloaded and stored in the browser. You would get the uncompressed file, and then it is up to the user to compress it for storage.

The main risk is the website running out of memory for really large data selections, really.

I can also export a JSON file, which would be more compact (and similarly easy to write), but I do now know how big the demand for that would be.

mschilli87 commented 6 years ago

The extra space needed would be repeating genes, accessions, cellids and labelvals ngene-1 times so even selecting 1,000 genes, that would be 1000 * a couple of bytes so unless I'm missing sth. the difference in (raw) size betwen your format and my suggesting should be in the order of maginitues of some megabytes max. That shouldn't be an issue for any browser being able to cache a few minutes of video, would it?

If you have an example TSV (your style), I could re-parse it and we could compare some real-world numbers.

JSON would be fine for me but then again I wouldn't mind working with the loom file directly. TSV would have the huge advantage with being useful for the Excel-folks, too (at least if you name your TSV file *.csv :wink:).

Anyways, don't feel pushed by me: Your suggestion is definitely parsable in a convenient-enough fashion, so stick with it if you prefer it simple and small. :smiley:

JobLeonard commented 6 years ago

he need to convince me yours is the better option! I was just arguing that zipping the exported file makes little sense when it is generated client-side. There is no bandwidth saved, and the first thing most people would to is unpack it anyway ;)

But for the sake of completeness: your maths is a bit off, since you forget to multiply by the number of cells!

In Sten's group we have quite a few loom files of tens of thousands of cells, and one test-file of 200k (which I use to stress-test the performance of the viewer). We do not know what the upper limit is here yet.

Also, it's a text file of strings, so the "couple of extra bytes" = gene name length + accession string length + CellID string length + label name string length (think stuff like "Oligodendrocyte" or a sentence-long comment) + expression converted to numerical string. Assuming one That is closer to, say, 50 chars = on average, including expression value.

TBH, selecting 1000 genes is a bit silly: first, I am not sure if the browser would be able to handle displaying that many sparklines for the bigger datasets. More importantly, at that point you might as well just download the raw loom file I think. Maybe I should add a warning suggesting it, with a direct download link to the loom file.

Assuming the data size will keep growing: 200k cells 100 selected marker genes 50 bytes = roughly 1 MiB. Even 1000 genes would likely be in the order of 10 MiB (if my 50bytes thing is way off, I doubt it will be more than).

So things should be fine.

Anyway, onto how to implement this. Conveniently, there is a ready-made React component for this:

https://github.com/petermoresi/react-download-link/blob/master/download-link.es6

All we need is a little code to convert the data to a CSV string. I'm not going to write out the whole structure for that here, but we do need to make sure that likely edge cases in labels are accounted for:

(it seems likely to me that someone will add commas and quotes into a label describing clusters, for example).

JobLeonard commented 6 years ago

Do you have an opinion on what the JSON file should be structured as? I'm thinking of this:

{
    "row_attrs": {
        "Gene": [ "[gene1]", "[gene2]", "..." ],
        "Accession": [ "[accession1]", "[accession2]", "..." ]
    },

    "col_attrs": {
        "CellID": [ "[cellId1]", "[cellId2]", "..." ],
        "[Label Attribute]": [ "[label1]", "[label2]", "..." ]
    },
    "rows": {
        "[gene1]": [ 0, 0, "..." ],
        "[gene2]": [ 0, 0, "..." ],
        "..."
    }
}

(Where "[ ... ]" are placeholders, and the ... are in quotes to please Github's JSON syntax highlighter)

Using row_attrs and col_attrs here to follow their names in the loompy API .

JobLeonard commented 6 years ago

Some proof for the comma-edge case: image