Project Idea for GSoC 2017

jainamritanshu commented 7 years ago

Hi everybody, I am Amritanshu Jain, a sophomore currently studying at BITS Pilani, Pilani Campus. I wish to apply for GSoC under your organisation. I have a programming experience of over six years in Python and JS, and have gained substantial command over Python libraries/frameworks such as Django, Scrapy, NumPy OpenCV, Pandas. I have also worked with Node.js and React/Redux for a few projects.

While going through your project ideas I got very fascinated by the idea of Add Spatial Data Support to Data Retriever since recently I made a small version of Data Retriever for Mutual Funds. The script would scrape data of past 10 years of almost 170 mutual fund schemes, process it into a csv and then save it into a local postgres database.

For the past few months I had been more involved with development of Web applications and Text Mining. As discussed with one of the mentors I would like to extend Data Retriever to disciplines where it lacks data such as market data and more. Kindly go through my CV if time permits. I have mentioned my experience and projects in brief in the same.

Lately I have been doing plenty of projects, hence I am already in a very good programming flow, I would love to program a basic prototype and so that I can assure you that I am capable to undertaking this GSoC project. Please guide me on how to proceed. Thanks, Amritanshu Jain Github - https://github.com/jainamritanshu/ Email - jainamritanshu@gmail.com @ethanwhite @henrykironde #2017 #DataRetriever

henrykironde commented 7 years ago

Hi @jainamritanshu, thank you for your interest in GSOC Data Retriever 2017, specifically Add Spatial Data Support to Data Retriever. As part of the application, we do request student to go through the issues of the Data Retriever and make some contributions. Feel free to ask for clarification on the issues.

Additionally, we recommend students to read the [student contribution page for Gsoc] clearly (https://github.com/numfocus/gsoc/blob/master/CONTRIBUTING-students.md)

ethanwhite commented 7 years ago

Hi @jainamritanshu - thanks for the contributions you've been working on and for your interest in adding spatial data support. Spatial data is becoming a really common format it would be great to support importing it into database management systems. As I mentioned over email we're also definitely interested in work actively adding new kinds of data.

You've already seen the original issue on this (https://github.com/weecology/retriever/issues/325). In short, we like to support importing both raster and vector spatial data into PostgreSQL + PostGIS and SQLite + SpatiaLite. At the most basic level this could involve only formats that can be directly imported and importing them using the projects with which they are defined. We're happy to answer more questions as you start to think about the details of how you might go about tackling this.

jainamritanshu commented 7 years ago

@ethanwhite @henrykironde I was going through a couple of articles on Spatial Databases. I found out that Postgres has a great support for spatial features, although MySQL does provide a few number of Minimum Bounding Rectangle functions. It does have some drawbacks too like while contemplating the applications, MySQL needs to be checked to see whether the necessary functions are provided or not whereas PostgreSQL does not require an examination. We have some solutions available like incorporating extensions like MyISAM and InnoDB. So along with PostgreSQL and SQLite, why don't we try to implement the spatial data support in MySQL engine as well?

ethanwhite commented 7 years ago

I have no objections to trying to add it as well. I'd recommend including it as a stretch goal in the proposal, with the plan to work on it if the core work on Postgres & SQLite is complete and extra time is available.

jainamritanshu commented 7 years ago

@ethanwhite @henrykironde As I was going through the vector and rastar data types, I realized that both of them are suitable for different types of datasets separately. Like vector data type is more suitable for Human World (Like census data and more) and raster data type with Natural World (Like information related to landslides etc in particular regions). So should we give the selection of data type completely in the hands of the user or should we suggest the data type by looking at the dataset. Can we make such a script that can automatically detect the most suitable data model for the given dataset? Maybe some of these points could help us determine the data type.

Continuous data is poorly stored in vector data type
Work with pixels or coordinates? - Raster data works with pixels. Vector data consists of coordinates.
Scaling the features? - Vectors can scale objects up to the size of a billboard. That type of flexibility is not available with raster data type.
Restrictions for file size? - Raster file size can result larger in comparison with vector data sets with the same phenomenon and area.

These are some of the points I could gather with my current understanding of the difference between the two data types. Also I don't think we are expecting the user to have a pretty decent knowledge to determine which data type he should use to incorporate a particular dataset. So if we are unable to make such a script to determine the suitable data type for a particular dataset, I suggest we should at least mention some of these points precisely and nicely to the user so that the user can make a wise choice.

I was also wondering that we should make two separate engines for vector and raster data type separately since they have different approach to make the database, or should we include it in a single engine/?

henrykironde commented 7 years ago

The composition, abstraction or more broadly the object oriented design concepts will be very clear as we work on this project.

Given that we have the categorization of each dataset(either a vector or a raster dataset) and the properties of each, the users will select only the information they want. For example, we know that our dataset A is a vector with a finite set of properties. If the user wants data described by a subset of these properties, the tool should be able to fetch that data. The users may or may not know the entire properties of these dataset, but they will have a good understanding of the set of properties they want from the dataset.

I was also wondering that we should make two separate engines for vector and raster data type separately since they have different approach to make the database, or should we include it in a single engine/? There will be separate objects for vectors and rasters.

jainamritanshu commented 7 years ago

@henrykironde @ethanwhite are there any specific gis datasets you are planning to incorporate in dataretriever?

Also we can add an enhancement of visualizing the data set using ggplot2(python binding for ggplot2 in R).

As I was wondering about the approach to follow, I tried to make an analogy with the current working of the data retriever. Please correct me if I am wrong, or if I am missing something.

GeoJSON/TopoJSON could be used for packaging the raw shapefiles retrieved at the first stage. We could use ogr2ogr for this purpose. Once the datapackage is ready we could compile it in the python scripts as previously done.
We would also have to make some changes in lib/tables.py file, I guess introducing a couple of new methods in the module itself would be sufficient, like methods for indexing(vastly done by r-tree, we could use Rtree) and more.
A couple of changes in lib/templates.py and lib/engines.py for pkformat/indexing and datatypes would suffice for the required purposes, since the methods have been nicely made in lib/engines.py, the methods could be overridden in separate engine file.

I am still thinking more on the approach that we could follow to tackle this. Any suggestions from your end would help me a lot.

henrykironde commented 7 years ago

@jainamritanshu, ogr2ogr, is basically Gdal which we shall definitely use for processing(reading, writing) the Gis datasets. Examples of these data sets are here https://freegisdata.rtwilson.com/ https://grass.osgeo.org/download/sample-data/ There are many file formats for both vector (including GeoJSON/TopoJSON) and raster data. However, not all are used frequently. We shall try to cover most of the formats used.

We shall not package using GeoJson because that may be redistribution of data. We can use the json data package standard to describe the dataset's main properties like the current tabular dataset scripts in the Data retriever.

We would also have to make some changes in lib/tables.py file, I guess introducing a couple of new methods in the module itself would be sufficient, like methods for indexing(vastly done by r-tree, we could use Rtree) and more. We shall try to use standard modules, unless otherwise. This enables stability of the software.

A couple of changes in lib/templates.py and lib/engines.py for pkformat/indexing and datatypes would suffice for the required purposes, since the methods have been nicely made in lib/engines.py, the methods could be overridden in separate engine file.

I do believe we shall have many files changed inlib, and these will be part of the files.

jainamritanshu commented 7 years ago

@henrykironde @ethanwhite I have prepared my first draft of the proposal. Kindly review it and any suggestions such as overloading the timeline or maybe things I am missing onto would be very helpful.

jainamritanshu commented 7 years ago

I have made a pull request in the gsoc repository to submit my proposal and I have submitted the final proposal through the GSoC portal :smile:

numfocus / gsoc

Project Idea for GSoC 2017 #170