titipata / pubmed_parser

:clipboard: A Python Parser for PubMed Open-Access XML Subset and MEDLINE XML Dataset
http://titipata.github.io/pubmed_parser/
MIT License
559 stars 164 forks source link

Example of processing in dask rather than pyspark #103

Closed charliejeynes closed 2 years ago

charliejeynes commented 2 years ago

Your documentation suggests you have tried using dask rather than pyspark to process pubmed/Medline. If you have any examples or tips it would be great if you would share - I'm going to attempt this myself but would be good to get some tips if you have already tried. I'm not keen on pyspark but like dask from what I've seen . Thanks

titipata commented 2 years ago

@charliejeynes thanks for your suggestions. Yes, we have tried PySpark but not on Dask yet. I have not tried implementing on Dask yet but I hope that the pipeline should be similar.

If someone have implemented, please feel free to add or make the pull request!

charliejeynes commented 2 years ago

Ok great I'll have an experiment and if it's worth it I'll make a pull request 🙂