mholt / PapaParse

Fast and powerful CSV (delimited text) parser that gracefully handles large files and malformed input
http://PapaParse.com
MIT License
12.44k stars 1.14k forks source link

Add support for transposing returned data. #68

Closed bhagyas closed 10 years ago

bhagyas commented 10 years ago

For an example, if the CSV contains a list of categories and sub categories, the sub categories must be able to be grouped under each category (which might also be defined as a header).

Papa returns this at the moment.

[
  [{"catA" : "subcatAA", "catB" : "subcatBA"}],
  [{"catA" : "subcatAB", "catB" : "subcatBB"}],
  [{"catA" : "subcatAC", "catB" : "subcatBC"}]
]

instead of the required.

[
  "catA" : ["subcatAA","subcatAB","subcatAC"],
  "catB" : ["subcatBA","subcatBB","subcatBC"]
]

Currently this can be achieved using underscore.js and following code ;

_.transpose = function(array) {
    var keys = _.union.apply(_, _.map(array, _.keys)),
        result = {};
    for (var i=0, l=keys.length; i<l; i++) {
        var key = keys[i];
        result[key] = _.pluck(array, key);
    }
    return result;
};

//results being the argument in complete callback function
var categories = _.transpose(results.data); 

Suggesting an optional boolean configuration parameter named transpose and results would be a transposed array mapped using the function above.

References http://stackoverflow.com/questions/13353674/how-to-transpose-object-in-underscorejs

mholt commented 10 years ago

Hm, I'm not sure I understand entirely how this is the role of the parser. What is the input CSV text? Maybe that will help me understand the question better.

bhagyas commented 10 years ago

The input CSV would look like this.

catA,catB
subcatAA, subcatBA
subcatAB, subcatBB
subcatAC, subcatBC

Spreadsheet programs have the ability to transpose (rotate) the data in a table.

bhagyas commented 10 years ago

Parser works fine, this is a suggestion to add an additional option to transpose returned results object. :+1:

mholt commented 10 years ago

I see. So you want a way for results to be grouped by column instead of row. Thanks for the suggestion!

First, the output that Papa gives for the input above, with header: true, is not like this:

[
  {"catA" : "subcatAA", "catA" : "subcatAB", "catA" : "subcatAC"},
  {"catB" : "subcatBA", "catB" : "subcatBB", "catB" : "subcatBC"}
]

because here the field (or property) name is repeated three times, and objects can't have duplicate property names.

Assuming that was just a mistake in your post (?), can you just iterate the results as they are, like:

for (var row = 0; row < results.data.length; row++)
{
    console.log("Subcat A:", results.data[row]["catA"]);
    console.log("Subcat B:", results.data[row]["catB"]);
}

(or something like that) to get the subcategories together? What are the benefits of transposing? (Admittedly I haven't come across a need for it, you're the first to ask, and I'm trying to justify the extra complexity.)

Also, what would we do in the case of streaming? I guess just get one value for each category at a time?

Thanks for helping me figure this out!

bhagyas commented 10 years ago

Yes, sorry I had made a mistake in providing the JSON. Thank you for the correction. I updated the previous comments with the correct JSON outputs.

You can see this functionality being used on http://www.wooroll.com home page. It uses PapaParse and underscore.js to create the list of job categories and jobs from a CSV file that is requested through an AJAX call.

The advantage of having transpose built-in would be that PapaParse users would gain another option to manipulate the returned data without writing custom code. This functionality would also help developers to create category/subcategory, menu container/menu item by using CSV files and PapaParse.

I'm not sure about the case when streaming. A proper transpose operation would need to consider all the rows and columns as it does a complete rotation of the contents.

Cheers.

mholt commented 10 years ago

@bhagyas That's awesome; thanks for the details. That's definitely compelling. It may not be too hard either since, even though the algorithm is O(r*c), we're already doing a double-for loop in some post-processing if the user enables dynamicTyping or header. Perhaps it could be built into there.

I do still need to nail down the logic for when streaming, in other words, the file is read in chunks and the results are delivered row-by-row... it can't be column-by-column because the file is read horizontally. I'll think about it but if anyone has ideas, I'd love to hear them. If a solution can't be reached, it would delay the implementation since I would then have to consider having config settings that are not compatible with others, which right now is not really a problem.

mholt commented 10 years ago

@bhagyas What to do in the case of empty fields? Imagine:

A,B,C
a,b,c
d,,f
g,h,i

What would be the expected results of transposing? This:

{
    "A": ["a", "d", "g"],
    "B": ["b", "h"],
    "C": ["c", "f", "i"]
}

or this:

{
    "A": ["a", "d", "g"],
    "B": ["b", "", "h"],
    "C": ["c", "f", "i"]
}

(Notice the difference is an empty string in the B column.) In other words, do we fill in missing values with empty strings or leave them out?

Also, what about too many fields per row:

A,B,C
a,b,c,d

What do we do with d? Surely this would generate an error, as it can't be associated to a column. What would be your expectation?

bhagyas commented 10 years ago

@mholt I think it would be best to give developers the choice of setting them as options.

Explanation

mholt commented 10 years ago

I figured out a way to get this to work with streaming... we'd just read in the first row, store those as the "column" names (which we already do), then we just have to transpose each row as we read it in.

Current result format when streaming with a header row:

// one row is obvious
[
    {
        "A": "a",
        "B": "b",
        "C": "c"
    }
]

Proposed format when streaming and transposing:

// one row not so obvious
{
    "A": ["a"],
    "B": ["b"],
    "C": ["c"]
}

The two are quite similar but have some profound differences. I did some experimenting and, beside the question above about how to handle missing or extra fields, I also wondered how the header config option would play in with transpose? Transpose would have to override header (kind of like stream overrides chunk) but not in the sense of using one or the other; rather, transpose would have to force header to be enabled. I suppose this could be okay and documented, but it is not very intuitive.

Although there may be a better way to do it, my goal remains to keep Papa as simple and easy to use as possible. I feel like there are so many variations about how a transpose can be implemented, as we've seen a little bit of already -- and the operation itself is actually not too difficult for a programmer to do just the way they need, without having to coerce a library to do it for him/her.

I really appreciate your feedback on this. I'm going to shelve this suggestion for now, but we'll see if something else comes up on the matter.

mysticaltech commented 4 years ago

@mholt Thanks for considering this, it would have helped me a lot actually. I would propose enabling the transpose option when header is false.

mysticaltech commented 4 years ago

But no worries, just found out that using the zip method from any functional library (licia, underscore etc.) it is easily doable.

 const rows = data.map((row) => row.data)
 const columns = zip(...rows)