data pipelines - Githubissues

mgalas / ETL4Data

Building ETLs is quite repetitive and data source specific, therefore, we want to provide an off-the-shelf set of ETLs for as many of the data sources listed in Landscape4Data as possible. This will give the user an ability to automate data harvesting and existing data updates

GNU General Public License v3.0

1 stars 3 forks source link

data pipelines #1

Open mgalas opened 7 years ago

mgalas commented 7 years ago

Design and implement data pipelines for each one of the 4 datasets identified by Landscape4Data people with use of Sqoop

Open Steet Maps is one of the 4 datasets we are aiming to support. In the instance the following task (https://github.com/mgalas/Scheme4Data/issues/4) shows we can't use BigETL then we want to start writing our own Sqoop-based harvester for OSM that will update this dataset weekly

ulince commented 7 years ago

From my understanding, Sqoop is used to bulk import data from relational databases into Hadoop. In the case of our custom harvester, is the aim to import the data sets from CKAN (PostgreSQL) into HDFS?

mgalas commented 7 years ago

As far as I remember Sqoop goes beyond relational databases-only. There should be possibility of importing from other sources, e.q. URLs (so that we can point it at OSM url). Please investigate.

ulince commented 6 years ago

From my research, I have come up with the following approaches:

Write a simple shell script that does the following: curl https://planet.openstreetmap.org/planet_latest.osm.bz2 | pbzip2 -cd | hadoop fs -put /path/filename Then, create an Oozie shell action to automate running it.
Similar to the previous approach, but mount HDFS as a local filesystem, and create a similar shell script, but downloading the file to HDFS directly.
Write a custom Sqoop connector that will download the file and import it to HDFS. This seems like the most complex one.

Apart from these, I haven't found or come up with other more elegant or standard solutions. Would any of these be acceptable? If not, could you give me further suggestions?

mgalas commented 6 years ago

https://www.linkedin.com/pulse/data-pipeline-hadoop-part-1-2-birender-saini https://www.linkedin.com/pulse/data-pipeline-hadoop-part-2-birender-saini

especially the second part looks interesting

mgalas commented 6 years ago

https://community.hortonworks.com/articles/52856/stream-data-into-hive-like-a-king-using-nifi.html

ulince commented 6 years ago

Here are the link I mentioned on our meeting yesterday: https://blog.cloudera.com/blog/2013/03/how-to-use-oozie-shell-and-java-actions/ https://github.com/jupyterhub/jupyterhub https://github.com/PanierAvide/OSM2Hive

ulince commented 6 years ago

https://blogs.msdn.microsoft.com/pliu/2016/06/19/run-jupyter-notebook-on-cloudera/

ulince commented 6 years ago

Data auditing project: https://github.com/keepright/keepright/tree/master/checks