Open mgalas opened 7 years ago
From my understanding, Sqoop is used to bulk import data from relational databases into Hadoop. In the case of our custom harvester, is the aim to import the data sets from CKAN (PostgreSQL) into HDFS?
As far as I remember Sqoop goes beyond relational databases-only. There should be possibility of importing from other sources, e.q. URLs (so that we can point it at OSM url). Please investigate.
From my research, I have come up with the following approaches:
Write a simple shell script that does the following:
curl https://planet.openstreetmap.org/planet_latest.osm.bz2 | pbzip2 -cd | hadoop fs -put /path/filename
Then, create an Oozie shell action to automate running it.
Similar to the previous approach, but mount HDFS as a local filesystem, and create a similar shell script, but downloading the file to HDFS directly.
Write a custom Sqoop connector that will download the file and import it to HDFS. This seems like the most complex one.
Apart from these, I haven't found or come up with other more elegant or standard solutions. Would any of these be acceptable? If not, could you give me further suggestions?
https://www.linkedin.com/pulse/data-pipeline-hadoop-part-1-2-birender-saini https://www.linkedin.com/pulse/data-pipeline-hadoop-part-2-birender-saini
especially the second part looks interesting
Here are the link I mentioned on our meeting yesterday: https://blog.cloudera.com/blog/2013/03/how-to-use-oozie-shell-and-java-actions/ https://github.com/jupyterhub/jupyterhub https://github.com/PanierAvide/OSM2Hive
Data auditing project: https://github.com/keepright/keepright/tree/master/checks
Design and implement data pipelines for each one of the 4 datasets identified by Landscape4Data people with use of Sqoop