vaexio / vaex

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
https://vaex.io
MIT License
8.28k stars 590 forks source link

Concatenating large files #2158

Open sk2 opened 2 years ago

sk2 commented 2 years ago

Hi, I noticed in the tutorials it mentioned performance may be better if multiple smaller files are combined into one larger file. I could open many files and then save them, but this involves a lot of disk duplication, which doesn't scale.

Is there a technique that could combine multiple smaller files by manipulating the individual hdf5 files? I have looked at some Linux hdf tools but I'm wary of clobbering the data format (especially when it comes to rows).

Ideally there's a way to join them by moving chunks of disk - I do not need to retain the original hdf5 files so the raw bytes could be moved.

I couldn't see anything in the tutorials/faq on this.

Thanks Simon

JovanVeljanoski commented 2 years ago

I am not aware of any such technique or tool. I understand your concerns about data duplication/redundancy, but keep in mind that this sort of conversion that is implemented now (that leads to data duplication) is much safer, i.e. is something goes wrong the original data is unaffected.

Besides, the concatenation needs to figure out whether the schema is consistent across all of the files and what to do if it is not.

Having said that, if anyone has a good idea on how to improve this, PRs are very welcome.

sk2 commented 2 years ago

Thanks for the quick response. That makes sense for the reasons you outline. I'm looking to append incoming data to a file, where I know the structure is the same, so copying the whole thing wouldn't work as well. In this case I'd probably be best going down the route outlined at https://forum.hdfgroup.org/t/merge-multiple-h5-files-containing-groups-into-one-h5/6911 Is the internal structure used when exporting a hdf5 from Vaex fairly easy to understand?

BTW - entirely off topic - but I saw you're in Amsterdam. I'm over in Amsterdam for Sigcomm in a couple of weeks, would be nice to say hi (I'm usually in Australia).

JovanVeljanoski commented 2 years ago

Ah i see your point. Indeed appending in coming data to an already existing blob is something we've been thinking about. I has come up in a few discussions in the past as well. I am afraid at this time I do not have a full solution.

Is the internal structure used when exporting a hdf5 from Vaex fairly easy to understand?

I've never really worked with hdf5 files outside of vaex (or very little at least when doing some testing). The structure should be quite simple: it is column oriented container. For more info of how this looks like, you can peak here: https://github.com/vaexio/vaex/blob/master/packages/vaex-hdf5/vaex/hdf5/writer.py#L20

and it is used here: https://github.com/vaexio/vaex/blob/master/packages/vaex-core/vaex/dataframe.py#L6904


Oh cool, very nice. Coming all the way. Yeah definetely send a message when you are around. Feel free to join our slack also (link on the front page of the repo).

sk2 commented 2 years ago

Hi, thanks, I'm trying the basic way, and both:

all_files = sorted(glob.glob("combined/*hdf5"))
df1 = vaex.open_many(all_files)

and

df1 = vaex.open_many("combined/*hdf5")

are giving me strange behaviour. These are time series data, and I am trying to join them together to export to a single hdf5 file for efficiency. They were all created from the same base sql query which created csvs which I saved using Vaex.

If I compare to each file:

 all_files = sorted(glob.glob("combined/*hdf5"))
total_length = 0
>>> for file in all_files:
...     df2 = vaex.open(file)
...     print("max", df2.end_time.max())
...     total_length += df2.shape[0]

I get a total length of

498009636

Where as the open many is that:

>>> df1.shape[0]
249004818

Is there a potential limit in the total amount that can be concatenated? Is there a better way to do the "basic" concatenation?

The unusual thing is that if I export the combined hdf5 file, it is about the size of each of the smaller .hdf5 combined - but doesn't seem to have all the data.

Thanks for the feedback on the adding to an existing blob, I'll take a look at the link. I'll also join the slack - I didn't notice the link on the website!

JovanVeljanoski commented 2 years ago

So looking at the code, vaex.open_many expects a list of file paths. It is vaex.open() that accepts a globe expression within as a first argument.

I would expect this df1 = vaex.open_many("combined/*hdf5") to crash or not work reliably.

Also looking at the code you are relying on df2.end_time.max(), maybe doing something like len(df) would be better. I don't know the if the sorting /ordering of the files matter, glob can be funny sometimes.

I would rely on df.shape (the sum of the individuals vs the the concatenated dataframe via any method) to draw my conclusions.

If there are any problems via exporting (like unsupported types etc..) there should be a clear warning / log in your console/notebook

sk2 commented 2 years ago

Hi, yeah I also tried with open(“*.hdf5”) and the max function didn’t return the max I expected.

I expect the df = vaex.open(“*.hdf5”) to open all hdf5 in that directory, and the shape of the df indicates this (it is the sub of each single .hdf5 file).

However the max value within df for a column is not the same max value if I check each .hdf5 file individually. it looks like its only checking in the first three files from the glob and then returns the max from that.

Should I open this as a new issue of type bug?

Thanks

JovanVeljanoski commented 2 years ago

If you can make a reproducible example that would be great!

sk2 commented 2 years ago

I’ll see if I can put together a reproducible example - given the large file sizes it luging ve tricky with synthetic data.

Ben-Epstein commented 2 years ago

@sk2 i ran into a similar issue and indeed the regular concat did not scale. I am using this internally and it has not caused any issues for me. I opened a PR but I'm not sure if Jovan or Maarten wanted it in the official code (it was a draft PR just to share it with people that may need something similar)

https://github.com/vaexio/vaex/pull/1910