vaexio / vaex

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
https://vaex.io
MIT License
8.28k stars 590 forks source link

Convert CSV to HDF5 with TOPCAT and VAEX #2041

Open godsmustbcrazy opened 2 years ago

godsmustbcrazy commented 2 years ago

Description I am following the documentation here https://vaex.readthedocs.io/en/latest/getting_data_in_vaex.html#getting-your-data-in-and-out-of-vaex. My workflow is to convert multiple csvs instead of multiple fits files. I have about 70 csvs each about 2G each and I was hoping to convert it into one HDF5 file. Since this would take a long time, I wanted to test out the workflow first with a small file.

Software information

Steps:

  1. Convert a csv to colfits using TOPCAT ./topcat.sh -stilts -Djava.io.tmpdir=/tmp tcat in=@filestoprocess.txt ifmt=csv out=allyears.fits ofmt=colfits
  2. Produces a fits file with no issues
  3. Convert fits to hdf5 - produces error below vaex convert .\allyears.fits .\allyears.hdf5 WARNING: UnitsWarning: Unit 'iso' not supported by the VOUnit standard. [astropy.units.format.vounit] [04/30/22 13:42:35] ERROR error opening '.\\allyears1.fits' __init__.py:259 Traceback (most recent call last): File "c:\users\prem\.pyenv\pyenv-win\versions\3.9.4\lib\site-packages\vaex\__init__.py", line 232, in open ds = vaex.dataset.open(path, fs_options=fs_options, fs=fs, **kwargs) File "c:\users\prem\.pyenv\pyenv-win\versions\3.9.4\lib\site-packages\vaex\dataset.py", line 85, in open raise IOError(f'Cannot open {path}, failures: {failures}.') OSError: Cannot open .\allyears1.fits, failures: -----<class 'vaex.astro.fits.FitsBinTable'>----- :Traceback (most recent call last): File "c:\users\prem\.pyenv\pyenv-win\versions\3.9.4\lib\site-packages\vaex\dataset.py", line 79, in open return opener.open(path, fs_options=fs_options, fs=fs, *args, **kwargs) File "c:\users\prem\.pyenv\pyenv-win\versions\3.9.4\lib\site-packages\vaex\dataset.py", line 1438, in open return cls(path, *args, **kwargs) File "c:\users\prem\.pyenv\pyenv-win\versions\3.9.4\lib\site-packages\vaex\astro\fits.py", line 91, in __init__ self._try_votable(fitsfile[0]) File "c:\users\prem\.pyenv\pyenv-win\versions\3.9.4\lib\site-packages\vaex\astro\fits.py", line 130, in _try_votable self.descriptions[clean_name] = field.description AttributeError: 'FitsBinTable' object has no attribute 'descriptions' . Traceback (most recent call last): File "c:\users\prem\.pyenv\pyenv-win\versions\3.9.4\lib\runpy.py", line 197, in _run_module_as_main return _run_code(code, main_globals, None, File "c:\users\prem\.pyenv\pyenv-win\versions\3.9.4\lib\runpy.py", line 87, in _run_code exec(code, run_globals) File "C:\Users\prem\.pyenv\pyenv-win\versions\3.9.4\Scripts\vaex.exe\__main__.py", line 7, in <module> File "c:\users\prem\.pyenv\pyenv-win\versions\3.9.4\lib\site-packages\vaex\__main__.py", line 73, in main vaex.convert.main([os.path.basename(args[0]) + " " + args[1]] + args[2:]) File "c:\users\prem\.pyenv\pyenv-win\versions\3.9.4\lib\site-packages\vaex\convert.py", line 130, in main df = vaex.open(args.input) File "c:\users\prem\.pyenv\pyenv-win\versions\3.9.4\lib\site-packages\vaex\__init__.py", line 232, in open ds = vaex.dataset.open(path, fs_options=fs_options, fs=fs, **kwargs) File "c:\users\prem\.pyenv\pyenv-win\versions\3.9.4\lib\site-packages\vaex\dataset.py", line 85, in open raise IOError(f'Cannot open {path}, failures: {failures}.') OSError: Cannot open .\allyears1.fits, failures: -----<class 'vaex.astro.fits.FitsBinTable'>----- :Traceback (most recent call last): File "c:\users\prem\.pyenv\pyenv-win\versions\3.9.4\lib\site-packages\vaex\dataset.py", line 79, in open return opener.open(path, fs_options=fs_options, fs=fs, *args, **kwargs) File "c:\users\prem\.pyenv\pyenv-win\versions\3.9.4\lib\site-packages\vaex\dataset.py", line 1438, in open return cls(path, *args, **kwargs) File "c:\users\prem\.pyenv\pyenv-win\versions\3.9.4\lib\site-packages\vaex\astro\fits.py", line 91, in __init__ self._try_votable(fitsfile[0]) File "c:\users\prem\.pyenv\pyenv-win\versions\3.9.4\lib\site-packages\vaex\astro\fits.py", line 130, in _try_votable self.descriptions[clean_name] = field.description AttributeError: 'FitsBinTable' object has no attribute 'descriptions'
JovanVeljanoski commented 2 years ago

Can you check if you can even open a fits file anymore?

I am not sure how much that part of vaex is maintained nowadays. Also, if I remember correctly, there maybe several fits file format conventions, vaex should support only the columnar one.

If the original data is in CSV perhaps best to convert that to HDF5. If you think that is too slow, a good idea would be to go via apache arrow. If you need more info on this let us know and we can help

godsmustbcrazy commented 2 years ago

I could not open the fits file either, it produced a similar error. I did end up converting the csvs one at a time into hdf5 in a jupyter notebook and then loading them all into vaex at once. I did not investigate this further after I was able to find a different workflow.

JovanVeljanoski commented 2 years ago

Ok great that you figured it out.

Let's keep this open until @maartenbreddels can take a look.