Machine Learning based projects

NeuralMonk commented 5 years ago

Currently, our Spam system is completely manual, but I think, instead of reviewing similar content/posts, we can use Machine Learning algorithms for easing the task.

SidharthBansal commented 5 years ago

Great idea. @jywarren I want to add a couple more idea. I know they are not Core Mission Driven Projects. We must focus on them before addressing these less important issues. But just to brainstorm a little.

[ ] Content Based Tag Recommendation System (Suggested by Jeff)
[ ] Anomalous Spam Detection System(As suggested by @SKashyapD )
[ ] Recommendation Systems for posts (@Saurabh19126848_twitter suggestion on gitter chat )
[ ] recommendation system for posts (@Saurabh19126848_twitter suggestion on gitter chat)
[ ] sentiment analysis ( @Saurabh19126848_twitter suggestion on gitter chat)
[ ] Tag Suggestions by Natural Language Processing on nodes(suggested by me)

I am highly in favour of automating our services. Main problem is with Rails absence of libraries to ML. We can find majority of above on based on Isolation Forest algorithms, Naive Bayes, BBN, CNN, ANN etc. which are heavily implemented in python, not in rails. Writing libraries from Scratch does not make sense at all. So, we also need to think of these considerations.

milaaraujo commented 5 years ago

I would love to participate in any of these projects! I've worked with Recommendation Systems and Sentiment Analysis during my graduation. But I don't know any libraries to Rails tho. I've only worked with libraries in Python and R before.

SidharthBansal commented 5 years ago

Same scene is with me. I will love to work on these projects. Some are in my current semester curriculum but they are heavily based on python and R.

On Sat, Jan 19, 2019, 1:23 PM Camila Araújo <notifications@github.com wrote:

I would love to participate in any of these projects! I've worked with Recommendation Systems and Sentiment Analysis during my graduation. But I don't know any libraries to Rails tho. I've only worked with Python and R before.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/publiclab/plots2/issues/4660#issuecomment-455758517, or mute the thread https://github.com/notifications/unsubscribe-auth/AUACQwG8buuV6NgCZ0UtchzatJ8fgaIlks5vEs79gaJpZM4aIqPK .

NeuralMonk commented 5 years ago

We could make flask server

On Sat, 19 Jan, 2019, 13:26 Sidharth Bansal <notifications@github.com wrote:

Same scene is with me. I will love to work on these projects. Some are in my current semester curriculum but they are heavily based on python and R.

On Sat, Jan 19, 2019, 1:23 PM Camila Araújo <notifications@github.com wrote:

I would love to participate in any of these projects! I've worked with Recommendation Systems and Sentiment Analysis during my graduation. But I don't know any libraries to Rails tho. I've only worked with Python and R before.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub <https://github.com/publiclab/plots2/issues/4660#issuecomment-455758517 , or mute the thread < https://github.com/notifications/unsubscribe-auth/AUACQwG8buuV6NgCZ0UtchzatJ8fgaIlks5vEs79gaJpZM4aIqPK

.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/publiclab/plots2/issues/4660#issuecomment-455758659, or mute the thread https://github.com/notifications/unsubscribe-auth/AqtjHul_KrFgr1v230-HkxgZWGPG_cyoks5vEs-PgaJpZM4aIqPK .

ryzokuken commented 5 years ago

Hi everyone, just dropping here to say that making a flask server for data science stuff is the correct approach here. Essentially, you would need a separate server crunching the numbers and acting as an interface to the models. This flask server would need to be run in a separate container and I volunteer to make appropriate changes to the docker-compose config to make sure this floats. Looking forward to assist people in implementing the above cool features in the website.

NeuralMonk commented 5 years ago

Tag Prediction Suggest the tags based on the content of the post posted on the website of public lab

1. Real World / Business Objectives and Constraints * 1.1 Predict as many labels as possible correctly. 1.2 No strict latency constraint. 1.3 Cost of errors would be a bad customer experience.
1. Machine Learning problem *
  - 2.1 Data Requires lots of data to train the machine learning model which can be done by API Data Field Explanation* Id - Unique identifier for each question Title - The question's title Body - The body of the question Tags - The tags associated with the question (all lowercase, should not contain tabs '\t' or ampersands '&')
2.2 Mapping the real-world problem to a Machine Learning Problem*
- 2.2.1 Type of Machine Learning Problem* It is a multilable classification problem Multilable Classification: Multilabel classification assigns to each sample a set of target labels. This can be thought as predicting properties of a data-point that are not mutually exclusive, such as topics that are relevant for a document. A text might be about any of religion, politics, finance or education at the same time or none of these. Credit: http://scikit-learn.org/stable/modules/multiclass.html
- 2.2.2 Performance metric Micro-Averaged F1-Score (Mean F Score*) : The F1 score can be interpreted as a weighted average of the precision and recall, where an F1 score reaches its best value at 1 and worst score at 0. The relative contribution of precision and recall to the F1 score are equal. The formula for the F1 score is: F1 = 2 (precision recall) / (precision + recall) In the multi-class and multi-label case, this is the weighted average of the F1 score of each class. 'micro f1 score': Calculate metrics globally by counting the total true positives, false negatives and false positives. 'macro f1 score': Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.
  
  2.2.3 Machine Learning Objectives and Constraints
  1. Minimize Micro avg F1 Score.
  2. Try out multiple strategies for Multi-label classification.

3. Exploratory Data Analysis 3.1 Using Pandas with SQLite to Load the data 3.2 Analysis of Tags 3.3 Cleaning and preprocessing

Sample data points
Separate Code from Body
Remove Special characters from Question title and description
Remove stop words
Remove HTML Tags
Convert all the characters into small letters
Use SnowballStemmer to stem the words

4. Machine Learning Models 4.1 Converting tags for multilable problems 4.2 Split the data into test and train (80:20) 4.3 featurizing data with TfIdf vectorizer 4.4 Applying Logistic Regression/SVM with OneVsRest Classifier

5. testing the model

Sources / useful linksyourtube : https://youtu.be/nNDqbUhtIRg https://youtu.be/nNDqbUhtIRg research paper : https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/tagging-1.pdf https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/tagging-1.pdf research paper : https://dl.acm.org/citation.cfm?id=2660970&dl=ACM&coll=DL https://dl.acm.org/citation.cfm?id=2660970&dl=ACM&coll=DL

On Mon, Jan 21, 2019 at 2:42 AM Ujjwal Sharma notifications@github.com wrote:

Hi everyone, just dropping here to say that making a flask server for data science stuff is the correct approach here. Essentially, you would need a separate server crunching the numbers and acting as an interface to the models. This flask server would need to be run in a separate container and I volunteer to make appropriate changes to the docker-compose config to make sure this floats. Looking forward to assist people in implementing the above cool features in the website.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/publiclab/plots2/issues/4660#issuecomment-455902842, or mute the thread https://github.com/notifications/unsubscribe-auth/AqtjHg_7Gl_e3BOjI4El8ttC6wvowmIKks5vFNuigaJpZM4aIqPK .

SidharthBansal commented 5 years ago

I really love your research but its important to take input from @jywarren whether or not the organisation is aiming at ML into current projects. Today or tomorrow we need to enable ml. But it depends on core mission projects too. So, Jeff will guide us best whether these could be further discussed or will be taken care later on. Thanks everyone.

On Fri, Jan 25, 2019, 6:19 PM SKashyapD <notifications@github.com wrote:

Tag Prediction Suggest the tags based on the content of the post posted on the website of public lab

Real World / Business Objectives and Constraints * 1.1 Predict as many labels as possible correctly. 1.2 No strict latency constraint. 1.3 Cost of errors would be a bad customer experience.

Machine Learning problem *

2.1 Data Requires lots of data to train the machine learning model which can be done by API Data Field Explanation* Id - Unique identifier for each question Title - The question's title Body - The body of the question Tags - The tags associated with the question (all lowercase, should not contain tabs '\t' or ampersands '&')

2.2 Mapping the real-world problem to a Machine Learning Problem*

2.2.1 Type of Machine Learning Problem* It is a multilable classification problem Multilable Classification: Multilabel classification assigns to each sample a set of target labels. This can be thought as predicting properties of a data-point that are not mutually exclusive, such as topics that are relevant for a document. A text might be about any of religion, politics, finance or education at the same time or none of these. Credit: http://scikit-learn.org/stable/modules/multiclass.html

2.2.2 Performance metric Micro-Averaged F1-Score (Mean F Score*) : The F1 score can be interpreted as a weighted average of the precision and recall, where an F1 score reaches its best value at 1 and worst score at 0. The relative contribution of precision and recall to the F1 score are equal. The formula for the F1 score is: F1 = 2 (precision recall) / (precision + recall) In the multi-class and multi-label case, this is the weighted average of the F1 score of each class. 'micro f1 score': Calculate metrics globally by counting the total true positives, false negatives and false positives. 'macro f1 score': Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.

2.2.3 Machine Learning Objectives and Constraints

Minimize Micro avg F1 Score.

Try out multiple strategies for Multi-label classification.

3. Exploratory Data Analysis 3.1 Using Pandas with SQLite to Load the data 3.2 Analysis of Tags 3.3 Cleaning and preprocessing

Sample data points

Separate Code from Body

Remove Special characters from Question title and description

Remove stop words

Remove HTML Tags

Convert all the characters into small letters

Use SnowballStemmer to stem the words

4. Machine Learning Models 4.1 Converting tags for multilable problems 4.2 Split the data into test and train (80:20) 4.3 featurizing data with TfIdf vectorizer 4.4 Applying Logistic Regression/SVM with OneVsRest Classifier

5. testing the model

*Sources / useful linksyourtube : https://youtu.be/nNDqbUhtIRg https://youtu.be/nNDqbUhtIRg research paper :

https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/tagging-1.pdf < https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/tagging-1.pdf

research paper : https://dl.acm.org/citation.cfm?id=2660970&dl=ACM&coll=DL https://dl.acm.org/citation.cfm?id=2660970&dl=ACM&coll=DL*

On Mon, Jan 21, 2019 at 2:42 AM Ujjwal Sharma notifications@github.com wrote:

Hi everyone, just dropping here to say that making a flask server for data science stuff is the correct approach here. Essentially, you would need a separate server crunching the numbers and acting as an interface to the models. This flask server would need to be run in a separate container and I volunteer to make appropriate changes to the docker-compose config to make sure this floats. Looking forward to assist people in implementing the above cool features in the website.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <https://github.com/publiclab/plots2/issues/4660#issuecomment-455902842 , or mute the thread < https://github.com/notifications/unsubscribe-auth/AqtjHg_7Gl_e3BOjI4El8ttC6wvowmIKks5vFNuigaJpZM4aIqPK

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/publiclab/plots2/issues/4660#issuecomment-457563092, or mute the thread https://github.com/notifications/unsubscribe-auth/AUACQ5sKU_-ulgC69LkAJf6AuPDBpMU2ks5vGv1ugaJpZM4aIqPK .

NeuralMonk commented 5 years ago

Hello everyone! Please let me know if I should start working on it since it will take a lot of time commitment and effort on my part. Or If you want me to work on something else please let me know.

On Fri, 25 Jan, 2019, 19:00 Sidharth Bansal <notifications@github.com wrote:

I really love your research but its important to take input from @jywarren whether or not the organisation is aiming at ML into current projects. Today or tomorrow we need to enable ml. But it depends on core mission projects too. So, Jeff will guide us best whether these could be further discussed or will be taken care later on. Thanks everyone.

On Fri, Jan 25, 2019, 6:19 PM SKashyapD <notifications@github.com wrote:

Tag Prediction Suggest the tags based on the content of the post posted on the website of public lab

Real World / Business Objectives and Constraints * 1.1 Predict as many labels as possible correctly. 1.2 No strict latency constraint. 1.3 Cost of errors would be a bad customer experience.

Machine Learning problem *

2.1 Data Requires lots of data to train the machine learning model which can be done by API Data Field Explanation* Id - Unique identifier for each question Title - The question's title Body - The body of the question Tags - The tags associated with the question (all lowercase, should not contain tabs '\t' or ampersands '&')

2.2 Mapping the real-world problem to a Machine Learning Problem*

2.2.1 Type of Machine Learning Problem* It is a multilable classification problem Multilable Classification: Multilabel classification assigns to each sample a set of target labels. This can be thought as predicting properties of a data-point that are not mutually exclusive, such as topics that are relevant for a document. A text might be about any of religion, politics, finance or education at the same time or none of these. Credit: http://scikit-learn.org/stable/modules/multiclass.html

2.2.2 Performance metric Micro-Averaged F1-Score (Mean F Score*) : The F1 score can be interpreted as a weighted average of the precision and recall, where an F1 score reaches its best value at 1 and worst score at 0. The relative contribution of precision and recall to the F1 score are equal. The formula for the F1 score is: F1 = 2 (precision recall) / (precision + recall) In the multi-class and multi-label case, this is the weighted average of the F1 score of each class. 'micro f1 score': Calculate metrics globally by counting the total true positives, false negatives and false positives. 'macro f1 score': Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.

2.2.3 Machine Learning Objectives and Constraints

Minimize Micro avg F1 Score.

Try out multiple strategies for Multi-label classification.

3. Exploratory Data Analysis 3.1 Using Pandas with SQLite to Load the data 3.2 Analysis of Tags 3.3 Cleaning and preprocessing

Sample data points

Separate Code from Body

Remove Special characters from Question title and description

Remove stop words

Remove HTML Tags

Convert all the characters into small letters

Use SnowballStemmer to stem the words

4. Machine Learning Models 4.1 Converting tags for multilable problems 4.2 Split the data into test and train (80:20) 4.3 featurizing data with TfIdf vectorizer 4.4 Applying Logistic Regression/SVM with OneVsRest Classifier

5. testing the model

*Sources / useful linksyourtube : https://youtu.be/nNDqbUhtIRg https://youtu.be/nNDqbUhtIRg research paper :

https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/tagging-1.pdf <

https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/tagging-1.pdf

research paper : https://dl.acm.org/citation.cfm?id=2660970&dl=ACM&coll=DL https://dl.acm.org/citation.cfm?id=2660970&dl=ACM&coll=DL*

On Mon, Jan 21, 2019 at 2:42 AM Ujjwal Sharma notifications@github.com wrote:

Hi everyone, just dropping here to say that making a flask server for data science stuff is the correct approach here. Essentially, you would need a separate server crunching the numbers and acting as an interface to the models. This flask server would need to be run in a separate container and I volunteer to make appropriate changes to the docker-compose config to make sure this floats. Looking forward to assist people in implementing the above cool features in the website.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub < https://github.com/publiclab/plots2/issues/4660#issuecomment-455902842 , or mute the thread <

https://github.com/notifications/unsubscribe-auth/AqtjHg_7Gl_e3BOjI4El8ttC6wvowmIKks5vFNuigaJpZM4aIqPK

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub <https://github.com/publiclab/plots2/issues/4660#issuecomment-457563092 , or mute the thread < https://github.com/notifications/unsubscribe-auth/AUACQ5sKU_-ulgC69LkAJf6AuPDBpMU2ks5vGv1ugaJpZM4aIqPK

.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/publiclab/plots2/issues/4660#issuecomment-457573031, or mute the thread https://github.com/notifications/unsubscribe-auth/AqtjHgNI9WMvwvfIuHshnnlTfUGIf3efks5vGwbjgaJpZM4aIqPK .

jywarren commented 5 years ago

Hi, thanks to everyone for your input here! I think there are some potential use cases for machine learning across the Public Lab ecosystem! But perhaps we need to do a bit more in-detail brainstorming on individual examples. For example, I'm not sure that running a containerized flask server as part of the plots2 codebase makes sense because it dramatically expands the setup complexity of the project (we had an issue with this in a previous project to run a Solr container), but perhaps it could make sense to develop in a separate repository?

Could such a separate server for data analysis access data via the API?

Of the brainstormed applications, i'm hesitant on the spam one -- i like the basic premise, but to me, it seems more sustainable and less 'reinvent the wheel' to look at an existing library or service for spam identification, like Askimet or something. I'm sure others have worked on this problem and am less sure we could provide something unique that would be competitive.

On the other hand, I'd love to think about places in the PL ecosystem where machine learning would present a really unique benefit that supports our overall mission.

Would Spectral Workbench be one of those places?

I note a mention of neural networks for trying to solve an issue here: https://github.com/publiclab/spectral-workbench.js/issues/56#issuecomment-457179753 (although seems that should be broken into its own issue)
@Lucaszw emailed me some time back with the idea of using machine learning to apply appropriate tags to spectra in SpectralWorkbench. That also seems interesting!

On MapKnitter, would it be plausible to scan images and try to identify features and tag accordingly?

The Vision API at Google Cloud can do some pretty interesting things there: https://cloud.google.com/vision/

Although in this test it didn't seem to find anything in this aerial photo except that it was an aerial photo 😄 :

Perhaps one approach here might be to begin a Zooniverse project using MapKnitter data: https://www.zooniverse.org/lab

Then that could be used as training data to develop a machine learning approach to identifying, say, areas of high risk of spills, pollution, etc.

Terrapattern tried doing something kind of like this: https://qz.com/764746/terrapattern-open-source-satellite-photo-search-tool/

http://www.terrapattern.com/about

That could be a really interesting approach, and I like the idea of using the MapKnitter image set to help an ML approach get better at identifying pollution.

Note that Terrapattern also uses OpenStreetMap tags to train it's model. Perhaps we could correlate MapKnitter images with any OSM tags which are overlapping with the images shown, although there might not be too many.

Anyhow, these are some ideas that get a bit at the environmental mission of Public Lab, and might make for an interesting set of possible projects that wouldn't necessarily live IN the plots2 codebase, but could be really powerful tools for our community.

jywarren commented 5 years ago

This is a really great example of using machine learning to identify environmental issues: https://skytruth.org/2019/02/using-machine-learning-to-map-the-footprint-of-fracking-in-central-appalachia/

it also gets at some of the challenges, as well as discusses how to use existing manually categorized datasets as a training set, OR to use existing databases to correlate with imagery to train a model. Great work, @skytruth!

NeuralMonk commented 5 years ago

Hey everyone and thanks @jywarren for your wonderful inputs and your proposed ideas are very cool and interesting. I have already started reading and researching about them. It will take me about a week to find out how things are supposed to be done. thanks, everyone.

NeuralMonk commented 5 years ago

Hey everyone, I have done my research on given ideas and devised the following plan: @jywarren, it is definitely a good idea to create a new repository for machine learning based projects, instituting a separate server for data analysis access data via the API.

We can host a Flask server in this way:

It will take the screenshot of the image,
Feed it to the input of the model,
Take the output of the model to show it on the web page.

Goal: Automatically label aerial imagery

Tagging,
Semantic segmentation.

Implementing the Machine learning model in simple steps:

Collect the pair of images and label,
Write a program that predicts labels for given images(model),
Let the computer automatically tune parameters to mimic examples(learning).

The lengthy task: collecting the pair of aerial images and label

One important yet rarely discussed aspect of using machine learning for aerial image interpretation is the source of the data. Since labelling images is a very time-consuming process, the datasets have been small in both aerial image applications and general image labelling work. Hence, obtaining good sources of accurately labelled data is important for both evaluating existing approaches and training systems that are likely to work under varying conditions. In some domains, hand-labelling data in order to train a classifier is not necessary because the label information is often readily available. For example, in the case of road detection (Semantic segmentation), the locations of existing roads are typically known because they are useful for navigation and not just as target labels in a machine learning task. The abundance of accurately labelled data for road detection makes it a very good candidate for evaluating existing aerial image interpretation systems as well as the application of machine learning techniques.

For buildings, Google Maps can provide the locations of a substantial portion of the buildings in almost any major city. This type of data can act as a source of noisy labels, which are correct with very high probability when they indicate the presence of an object and with lower, but still substantially high, probability when they indicate the absence of an object. Training a classifier on large amounts of this type of noisy data with a robust loss function can potentially produce a much better detector than by using a much smaller set of accurate labels. At present, there seem to be no applications of robust estimators to aerial image data with noisy labels. For object classes such as cars or areas for which Google Maps possesses neither accurate nor complete map information, hand-labelling data seems to be the option or to use of crowdsourcing tools like zooniverse https://www.zooniverse.org/ which helps us to make the dataset.

In a classification task, small translations or rotations can be applied to the input images, but in order to apply the same idea to image labelling one must be able to realistically transform both the image and the labels. On a road detection task, applying rotations to each training case before it is processed has been shown to help prevent overfitting

So we need to start making our own dataset for the better result. we can do it manually and I'd like to volunteer my self to do the same by using a python script. alternatively platforms like Zooniverse can be used to create the dataset https://help.zooniverse.org/getting-started/

The most important part is data. A larger and more accurate sample size will lead to the better results. The primary obstacle is the imbalance in dataset which makes detecting rare labels a difficult task.

Tagging;

It is almost similar task as I suggested earlier for the text the difference is that, now the dataset is of images so we need to use CNN to perform the following task there is a great blog post by Adit how CNN actually work for image classification: https://adeshpande3.github.io/A-Beginner%27s-Guide-To-Understanding-Convolutional-Neural-Networks/ screenshot from 2019-02-20 02-33-09

The machine learning model

Residual Network (ResNet) which is a major breakthrough in CNN. 1.allowing training model with 100's of the layer for grater accuracy.

layers compute residual(delta) between input and output

Why does it work?
each layer has less work to do(no copying)
allows gradient to flow more easily due to skipping connection

To understand more deeply you can go through a great intuitive blog : https://wiseodd.github.io/techblog/2016/10/13/residual-net/

Our approach to making our model better

1.instead of softmax, use the sigmoid activation function

2.optimize tag threshold to maximize F2 score

Many of the times we are trying to find the optimal threshold for F2 score using trial and error but instead of that we can find the best threshold using a brute-force search on a local validation set can actually net really good results on the LB, without much overfitting in the local score. Basically, you can try every possible threshold on a local validation set, and take the best performing threshold, applying it to the test set. And we also know that the best threshold is vastly different for each class. This means we can also get a big improvement by setting a different threshold for each class
Using pretrained model

A very common trick used in ML which is also known as transfer learning which means instead of training your model with random initialization we can initialize the parameters we got from another similar model who already trained on different data set. which is basically a great head start.

Simply put, a pre-trained model is a model created by some one else to solve a similar problem. Instead of building a model from scratch to solve a similar problem, you use the model trained on other problem as a starting point.

For example, if you want to build a self-learning car. You can spend years to build a decent image recognition algorithm from scratch or you can take inception model (a pre-trained model) from Google which was built on ImageNet data to identify images in those pictures.

A pre-trained model may not be 100% accurate in your application, but it saves huge efforts required to re-invent the wheel. Let me show this to you with a recent example.

Augment label dataset using lossless image transformation.

The more the data the better so like we can rotate our image by 90 degrees left and right which eventually increased the size of our dataset.

Tune learning rate (LR) manually it is very important to find which LR has best performance
Ensembling of 3 model architecture(optional)
1. ResNet 5x
2. inception 5x
3. DenseNet 5x

Or we can also do good with "ConNets101" it depends on what are the resources we have ensembling is good ML approach but give a little boost in F2 score and take about 15 times more computation than ConvNet101.

Semantic segmentation

Basically "semantic segmentation" attempts to partition the image into semantically meaningful parts, and to classify each part into one of the pre-determined classes. You can also achieve the same goal by classifying each pixel (rather than the entire image/segment). In that case, you are doing pixel-wise classification, which leads to the same end result but in a slightly different path. to understand it deeply check the very insightful blog https://www.jeremyjordan.me/semantic-segmentation/

ResNet based FCN architecture
fine-tuned a pre-trained model
Use IR R G image as input
Make prediction using sliding window because network only can handle 256X256
Ensembling average of five model

Other ideas for future works.

1.Detection of an oil spill.

Detecting oil spill accurately using CNN is a very tough task because there are some natural phenomena which look similar from space and a small sample size does not help. We need SAR images to detect oil spill correctly because in SAR image oil spill look like in dark formation which can be easily get detected. The following can prove to be usefull:

Fully convolution Network
FCN-GoogleNet
FCN-ResNets
deep neural autoencoder

2.Detection and mapping of plastic

We can able to detect plastic on our trained model using object detection while labelling the data we need make a specific label for plastic or no-plastic so that our CNN network can use thousands of the example of labelled plastic pieces such that it will finally able to tell what is a piece of plastic and what is not. We can able to detect a different type of plastic like rope toy etc.

Air pollution

When somebody uploads an image on mapknitter with Geo-tagging we can able to find the PM2.5 level and detect the air quality using following link https://aqicn.org/map/india/#@g/19.9884/80.5078/5z so we can able to classify air is polluted or not in the given region.

But to predict future air pollution patterns in is itself a major machine learning task.

PM2.5 refer to the tiny particle in the air that reduce visibility and cause air to appear hazy and get affected by the meteorological and traffic factor, burning of fossil fuel, Industrial parameters such as power plant emission play a significant role in air pollution.

The required data-set

Temperature
wind speed
Dewpoint
pressure
PM2.5 Concentration
classified data sample(polluted or not)

Our system does two tasks: 1) detect the level of PM2.5 on given location 2) Predict PM2.5 value for a particular date 2.1) Logistic regression to predict air is polluted or not 2.2) Autoregression to predict a future value of PM2.5 based on the previous PM2.5 value reading

Since our plan is quite extensive, I'd like to begin working on it as soon as possible. I'd like to invite inputs from you regarding the same, primarily should I start the project on zooniverse or should I start labelling it manually?

thanks, everyone

jywarren commented 5 years ago

Hi! This is a lot of information - thanks for compiling it! I wanted to ask a few things first --

With such a complex system, perhaps we should do some diagramming to show what the parts of the system are, and what are the potential ways to fulfill each part -- we could start with a diagram template like the one linked here, that was used to generate the plots2 data model: https://github.com/publiclab/plots2/blob/master/doc/DATA_MODEL.md
I'm really interested in good integration with existing efforts -- what portions of systems like Terrapattern and others are re-usable, or could we at least remain compatible with? https://github.com/CreativeInquiry/terrapattern
For buildings, Google Maps can provide the locations of a substantial portion of the buildings in almost any major city. -- I'd even prefer OpenStreetMap, which Terrapattern uses, and is an open source data source which we could also encourage people to contribute to in order to improve the training! See how to query here: https://github.com/publiclab/leaflet-environmental-layers/issues/50 and also a lot about more data sources to draw from in https://github.com/publiclab/leaflet-environmental-layers/ !
For the PM air quality data, do you think perhaps it's possible that there is no visible sign of air quality issues in MapKnitter images? or if you're not using images to correlate, but just data, there may be other models to look to first.

I hope this helps!

jywarren commented 5 years ago

Oh, and also, starting a Zooniverse project would be GREAT! @zengirl2 may be interested in this too.

NeuralMonk commented 5 years ago

thanks @jywarren for great inputs and making things more clearer and interesting.

Yes it is little complex and i will try to breakdown things in simpler way and i started working on this I will try to complete it as soon as possible.
for now we can able to do Semantic segmentation part which can help model to predict tags like ROAD, BUILDING, WATER, TREES, VEGETATION because there is data available freely like eg- https://project.inria.fr/aerialimagelabeling/ and we can use opneStreerMap http://openstreetmapdata.com/ so we can start doing thiis
Using open source is always fun.
Using images we can only able to find out whether or not the image is hazy but with the location of the image we are able to find out its PM2.5 value of that particular location.

NeuralMonk commented 5 years ago

Zooniverse sounds great! I guess you should create a team first and add me (and @zengirl2 or anyone else who is interested too) and I could then flesh out the rest of the project.

Hope this sounds good?

jywarren commented 5 years ago

oh very cool, yes that sounds good! Can you email me with your email or Zooniverse username at jeff@publiclab.org?

On Wed, Mar 6, 2019 at 6:50 AM SKashyapD notifications@github.com wrote:

Zooniverse sounds great! I guess you should create a team first and add me (and @Zengirl2 https://github.com/Zengirl2 or anyone else who is interested too) and I could then flesh out the rest of the project.

Hope this sounds good?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/publiclab/plots2/issues/4660#issuecomment-470078821, or mute the thread https://github.com/notifications/unsubscribe-auth/AABfJ6PXdmCVBLNsHluBxt-7LwtZy7tdks5vT6t9gaJpZM4aIqPK .

Zengirl2 commented 5 years ago

@SKashyapD Hey there--I do have a strong interest in Zooniverse, but I'm still behind on a fan project I'm working on. So, you can include me, but I won't be able to do much right now.

NeuralMonk commented 5 years ago

untitled diagram Most simplest way to show how things going to work each and every block have there own technical details. please create a repository and I will explain every technical detail on it.

thanks @jywarren for creating zooniverse project. zooniverse project looks great I started working on it but I have to know few things first to make it better and clear. -what we are specifically looking for(core mission)? -what are the labels we are going to take to create our database? -anything important you want to mention?

should i start working on semantic segmentation part?

thanks everyone

NeuralMonk commented 5 years ago

thanks @Zengirl2 for showing interest . And any kind of contribution will be great. @jywarren please add @Zengirl2 to our zooniverse project.

Zengirl2 commented 5 years ago

@SKashyapD I originally had interest in using Zooniverse to go through possible pollution from hurricanes. They have started to do projects for hurricanes (although not with the pollution I would like). I was at the point of having conversations with two people from Zooniverse about learning to use their content system. I believe I may even have a video tutorial that they sent me.

NeuralMonk commented 5 years ago

I am really excited to complete zooniverse project and semantic segmentation part @jywarren please give me some inputs so that i can start working and I will try complete all this as soon as possible. @Zengirl2 please give me that tutorial video it will help me a lot.

Zengirl2 commented 5 years ago

@SKashyapD Here's the links for some helpful info about setting up projects on Zooniverse (this was based on a specific example of flood/hurricane I had been asking about).

Doc Explanation https://docs.google.com/document/d/1W5y5Iq6WY5OpP6P4kcHrE6od0tGBFhO0huXvXHJJCzs/edit?usp=sharing

Youtube video https://www.youtube.com/watch?v=_bcu5tJDjPY

NeuralMonk commented 5 years ago

thanks @Zengirl2 for providing me resources. @jywarren please let me know when your are finished and I already working on some prerequisite that will help us in future

jywarren commented 5 years ago

I think @Zengirl2's idea for core mission is great -- identify specific types of pollution from aerial photos -- and we can start with whatever is a good initial training set.

I added @Zengirl2 to the zooniverse! Thank you!

jywarren commented 5 years ago

There are lots of Hurricane Harvey images linked to from posts on this page: https://publiclab.org/wiki/harvey#Questions -- i hope that helps!

NeuralMonk commented 5 years ago

@jywarren That sounds amazing. I will start working on this thing immediately. Should I make a summer of code proposal for image labelling of Mapknitter using semantic segmentation and tagging.

jywarren commented 5 years ago

Sure, if you're interested in submitting a proposal, that would be great!

On Fri, Mar 15, 2019 at 3:10 PM SKashyapD notifications@github.com wrote:

@jywarren https://github.com/jywarren That sounds amazing. I will start working on this thing immediately. Should I make a summer of code proposal for image labelling of Mapknitter using semantic segmentation and tagging.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/publiclab/plots2/issues/4660#issuecomment-473408580, or mute the thread https://github.com/notifications/unsubscribe-auth/AABfJ9kvSQtMKqNnTUhWTZtIUQBiYRatks5vW-_2gaJpZM4aIqPK .

NeuralMonk commented 5 years ago

hello everyone

1.@jywarren I have done some work on our zooniverse project and I uploaded a random aerial picture check our workflow it looks great Screenshot from 2019-03-22 00-47-36 2.we need to upload some data to start classifying. Making a CSV file would be great with longitude and latitude information. 3.After uploading the data we need to share our project link as much as possible. Announcing it on Public lab website would be a great start.

there is still lot of work to be done and I am trying to do it as soon as possible with maintaining the content quality.

@Zengirl2 please review it and suggesting few tags would be great.

Thanks @jywarren I am working on the proposal. and I have contacted a main contributor of terrapattern and he provided me some very useful link like OpenStreetMap data set they have used and I have discussed technical details they have tried and used and it is really helping me out to see things more clearly about our project.

cheers! thanks everyone

NeuralMonk commented 5 years ago

@Zengirl2 do you have any idea how much images we can able classify in 2-3 months?

@jywarren if we are able to classify enough images on zooniverse then after making a neural network model on OpenStreetMap data we can able to add our zooniverse data to classify environmental issues which is know as batch processing in machine learning.

jywarren commented 5 years ago

This is very interesting! Would you be interested in trying to use some of the imagery from Hurricane Harvey that was posted in the link I shared? Cool!

On Thu, Mar 21, 2019 at 4:24 PM SKashyapD notifications@github.com wrote:

@Zengirl2 https://github.com/Zengirl2 do you have any idea how much images we can able classify in 2-3 months?

@jywarren https://github.com/jywarren if we are able to classify enough images on zooniverse then after making a neural network model on OpenStreetMap data we can able to add our zooniverse data to classify environmental issues which is know as batch processing in machine learning.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/publiclab/plots2/issues/4660#issuecomment-475388287, or mute the thread https://github.com/notifications/unsubscribe-auth/AABfJ7R2JQIkZz0eYPD7pIR02eA_wd2hks5vY-phgaJpZM4aIqPK .

NeuralMonk commented 5 years ago

I download the set of images from this link should I upload the whole data set or few images ? @jywarren

Zengirl2 commented 5 years ago

@SKashyapD This is awesome! I was going to suggest the same as @jywarren about images. Can you explain exactly what you mean by tags in this case? Do you mean tags for people to find the info or tags of more examples of pollution or other indicators that interest us? Also, as far as how many images we can classify--do you mean when people look at them on Zooniverse to mark what they find or do you mean some process before the images are loaded into Zooniverse? Sorry, I'm really an Arduino hardware person that understood Zooniverse could be a possibility, but didn't really have the programming knowledge to make it happen. You are really bringing a dream of mine to life! :joy:

NeuralMonk commented 5 years ago

Thanks @Zengirl2 for such a wonderful reply. 1.Images are too big which will make the tagging task difficult should I crop them first? or there is any other way to do so. 2.Yes tags for type of pollution and indicators or anything you can found useful 3.Actually I wanted to know the response of volunteers we can expect?

thanks everyone

NeuralMonk commented 5 years ago

hello everyone

I uploaded 147 imagery from Hurricane Harvey after slicing and removing irrelevant images. I will try to add more images soon @Zengirl2 please review it. @jywarren can I add something to research section ? and to update the Team section can you please provide me your portfolio links @jywarren @Zengirl2 which will make our project look more promising.

thanks

Zengirl2 commented 5 years ago

@SKashyapD Where can I view the work on the Zooniverse project (besides the screen grab earlier)? Would I have received an invite? Also, I think if I remember correctly Zooniverse can include our project on Zooniverse (rather than just a public project)--that's how you get a lot of action on it. Have you decided which way you are going to categorize it?

Zengirl2 commented 5 years ago

@SKashyapD I sent a note to my contacts at Zooniverse letting them know you are working on a project. Also, my user name on Zooniverse is @Zengirl2 as well. :)

jywarren commented 5 years ago

I think I sent you an invite, @zengirl2!

On Sat, Mar 30, 2019, 1:30 PM Leslie Birch notifications@github.com wrote:

@SKashyapD https://github.com/SKashyapD I sent a note to my contacts at Zooniverse letting them know you are working on a project. Also, my user name on Zooniverse is @Zengirl2 https://github.com/Zengirl2 as well. :)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/publiclab/plots2/issues/4660#issuecomment-478268611, or mute the thread https://github.com/notifications/unsubscribe-auth/AABfJ87qeKh6KBXLZ-c2NlbRsd77Cn5Jks5vb58RgaJpZM4aIqPK .

NeuralMonk commented 5 years ago

@jywarren I have sent you my summer of code proposal on your email and I want to hear your feedback. Since we are pressed for time.

thank you.

NeuralMonk commented 5 years ago

@Zengirl2 to edit the project you can go through this link too. what are the criteria to get selected as zooniverse project? which type of categorization you are talking about? categorization of dataset?

Zengirl2 commented 5 years ago

@jywarren and @SKashyapD - when I log into Zooniverse it is not showing that I'm connected to any projects. Jeff, I remember seeing where you said you were going to invite me, but I don't remember getting any email about it. Can you see what name you used to add me?

@SKashyapD what I was talking about as far as whether this is a Zooniverse project or private project is listed under lab policies.

NeuralMonk commented 5 years ago

Please check your email you may have received the respective email, because your username is same in the project @Zengirl2 .

NeuralMonk commented 5 years ago

For categorization of the project @jywarren may tell better about it do we have enough volunteer for classification task?

Zengirl2 commented 5 years ago

@SKashyapD Hey, just got the email today. Will look at the project tonight when I get home :unicorn:

Zengirl2 commented 5 years ago

@SKashyapD I had a chance to look at the project and it is coming along fine. I noticed that when I chose to mark an image, that it did not give me another image once I had completed. Was this because it is not yet live? Or have you not attached a file of images yet? Anyway, here's my comments:

If this is just a test, it is fine that it is not a full blown Zooniverse project. Just sending the link to the Public Lab community once this is live is good.
Usually a Zooniverse project only takes on marking an image for one or two things. We are asking more by having many types of pollution. I know just trying to identify oil sheen from an image is difficult, so we probably need to develop a tutorial. Also, a gas company flare--would that be considered pollution? These are some of the things a tutorial can make more understandable :). In fact, the original image you used as an example earlier before you sent the link for the project was great--perhaps that can be used for the tutorial.
We should probably make it more clear why we are trying to do this work, so maybe filling out the field guide section would be a good idea as well.

NeuralMonk commented 5 years ago

@Zengirl2 I have fixed that problem now it is working properly. please look up to it again.

I will make a tutorial as soon as possible, and I will add more images too. Can you provide an exemplary tutorial anything which can help to make the tutorial better?
I have done some research during the making of summer of code proposal for why we are doing it, so can I add few things @jywarren?
Can you provide me with your BIo or something which can help me to create Team section @jywarren @Zengirl2? it will help us to make our project looks good.

Thank you!

Zengirl2 commented 5 years ago

Hey @SKashyapD--your images are working correctly now :tada:

Great example of a similar project and tutorial (it has already completed but you can still view it)
Tutorial Details - I know some important things we were talking about identifying was sheen on water from oil spills, damaged infrastructure (like large oil tanks that get ripped open or toppled from hurricanes), flares (the flames from stacks from gas companies) and I'm wondering if we can identify tar on beaches? Maybe that counts as oil spill, too.
Drawing "Mining" - I was having difficulties using this--do you need to make more than two points? It said "2 of 0 required drawn" when I tried it.
Classification section - This seems to be a summary of the places identified by the symbols/drawings, but not sure where/how I'm supposed to input any information (like for instance if I knew there was a gas plant in a location).
Pretty Stuff - The hurricane project example I gave you earlier helps to show how to make a project attractive/needed. I'm thinking we may be able to get a photo for the front page that looks more like hurricane devastation. I believe we have images already on Public Lab's site that could be useful, so I'll try to find one. This also affects the message on the top of the project...maybe something like "We need your help recognizing pollution from aerial images so we can prepare for future disasters". Also, where you have the quote about "destroying oceans" maybe we can give more detail about how hurricanes and other disasters cause pollution of air, water and soil for living things in surrounding areas long after the initial event. Also, the ability to identify pollution from aerial images helps to hold companies accountable for preparation and remediation. Think Skytruth :)
My bio (you can use my pic from Github--let me know if you need it larger)- Leslie is a user and educator of open source hardware and volunteers with Public Lab to help others investigate their environmental concerns. She is currently working on a Master's of Environmental Studies with a focus on Conservation Tech at University of Pennsylvania.

skilfullycurled commented 5 years ago

I saw machine learning and I wanted to chime in. Of the original list that @SidharthBansal compiled from the different source of requests, I wanted to add that we had been discussing the tag recommendation tangentially on the website (the code part of the conversation which has moved to Github). At any rate, to @jywarren's comment above regarding not 'reinventing the wheel' there are some recommendation engines in Ruby that I recommended (har har) my comment here.

NeuralMonk commented 5 years ago

Sorry for the delay @Zengirl2

I started working on the tutorial and thanks for the resources.
For drawing mining I selected polygon because it will help us to map mining area better(you can draw any required shape).
In classification section I will add an extra section for notes like this.
I will update the few section of the project to make it appealing.
Thanks for the bio @Zengirl2.

Can you please provide me few resources for more images @jywarren @Zengirl2

Thanks everyone!

NeuralMonk commented 5 years ago

Thanks @skilfullycurled for taking initiative. you can check this it may help recommendify

publiclab / plots2

Machine Learning based projects #4660