Concatenate data sources (e.g., parsed CSV)

tiguchi commented 7 years ago

This is just an optimization request, not a bug report.

Following scenario: I'm loading a CSV file that contains product definitions. It has about 3000 records and 13 fields. Each line has up to 2000 characters of string data. Strukt is reasonably fast doing just filtering. Sorting by one of the fields takes a long time though. I stopped the operation after about 1 minute. It would be great if this could be sped up.

kengruven commented 7 years ago

This is a totally legitimate issue. I intend for performance to be one of Strukt's core features.

I'm not sure what's going on in your case. Sorting 3000 rows should be really fast. In fact, one of the test cases I've been using this week is: GenerateRange(n=100,000)→Shuffle→Sort(field=i). It takes less than 4 seconds to sort 100,000 rows on my Mac. Strings will take a little longer to sort than numbers, but not that much longer.

I made a CSV file with dummy data (3000 rows x 10 columns), and put it in Load File → Parse CSV → Sort. It takes about 0.25 seconds here.

I'll keep investigating, and see if I can come up with any reason why it would be so slow. If you have a data file you're willing to share (publicly or privately) that demonstrates the issue, I'd be more than happy to take a look.

tiguchi commented 7 years ago

I just tried to reproduce the issue and at first Strukt was speedy as you expected. Sorting took under 2 seconds. Then I recreated the exact operation setup from the GIF I posted earlier. When I then sort the modified SKU column by clicking the column in the table view it can get randomly stuck in some kind of an infinite calculation loop where Strukt doesn't snap out anymore.

The first sort click action works fine, and after 2 seconds everything is sorted. I click again and the timer keeps on running and running without end. It's really finicky to reproduce. At the moment it appears stable to me when I toggle the sort order, so it's random and I had really bad luck.

Here's a screen shot of the script. I also made a mistake. I was thinking that the two load file / parse CSV operations in the beginning accumulate the results and produce one big result set. I added that mistake to my test in order to make it easier to reproduce the bug:

Unfortunately I cannot publicly post the source files of the CSV files. I don't want to draw attention to them. I can send you the two URLs by email or another channel.

kengruven commented 7 years ago

Aaah, I know what's going on. There's a (known) bug that LoadFile→LoadFile never completes, for some reason. I haven't investigated it yet because I thought it was obscure enough that nobody would run into it -- oops.

Even if LoadFile didn't have this bug, that approach wouldn't work right now, anyway. ParseCSV simply drops any input records that don't have a "line" (but maybe it shouldn't).

I never thought of appending data sets this way, but it's an interesting idea. LiteralValue can add its records where=before or where=after, so it would be consistent for LoadFile to support the same flag.

I'll take a swing at this and get back to you.

kengruven commented 7 years ago

Now that I got 1.1 out the door (tougher than it looks!), I'm free to fix a lot of these usability issues.

In the next version (1.1.1, later this week or early next week), LoadFile will pass all of its input through to its output before loading the file. So you will be able to do LoadFile(a.csv)->LoadFile(b.csv)->... to concatenate two files together. It's not ideal, but I think it's an acceptable compromise until there's a better way to structure it.

I may do something similar for ParseCSV, so that the above pipeline works as-is, but I want to check out the other operations (ParseTSV, etc.) to make sure I can do it in a consistent way.

kengruven commented 7 years ago

You can now (as of 1.1.1) concatenate files (text or binary) using LoadFile. You can't concatenate CSV data yet, exactly, because ParseCSV will drop non-text cards. So if you're willing to put up with an extra header in the middle of your data (or filter it out some other way), you can work around this.

I'm going to rename this issue to reflect the actual goal, but leave it open. Dealing with this well, in the general case, is something that's always been on the roadmap. It just hasn't been a top priority so far.

Chris-Knight commented 6 years ago

I would also like the ability to concatenate data sources. In my case it is multiple plist files that have the same structure but mostly different data. Being able to load in multiple plist files and then output them as one plist or json or csv file would be very, very nice. Being able to combine data from different sources would even be better. For an example, Safari used to save browser history in plist files. Now it saves the same history in sqlite files. Being able to combine this data and then outputting them into some format would be very, very slick.

kengruven / strukt-bugs

Concatenate data sources (e.g., parsed CSV) #2