ethancaballero commented 8 years ago

Given CS/ML paper describing a model, generate the model's source code. Somewhat insane, but solvable/impactful. [this request has since been altered as described below]

ilyasu123 commented 8 years ago

At present the problem is too hard, and success is impossible. Con you narrow down the problem? For example, can you find a similar problem where the program description are only a couple sentence long, and the programs do not exceed 20 lines of code? Using gitxiv as a source of training data is cool, but it is too hard, even for a sequence to sequence model with attention.

ilyasu123 commented 8 years ago

gitxiv as a source of training data for paper2code is too hard for 2016's machine learning. Try to find a source of programs that do not exceed 20 lines in length. Otherwise the project is pretty much guaranteed to be unsolvable barring significant breakthroughs in ML.

ethancaballero commented 8 years ago

I've altered the research request file to meet the specifications you outlined in your previous two responses. I'm currently working on scraping a dataset of short program descriptions and short (~20 line) programs. I'll probably have the dataset by the end of the week. Also, you might be able to ask Sam Altman to get one of YCombinator's companies (such as Triplebyte/HackerRank) to allow access to its (~20 line) description-code pair (programming interview) dataset(s).

ilyasu123 commented 8 years ago

I think the most important task is the preparation of the dataset (which will become a valuable resource in the community). For the purposes of this research request, do not assume that we will get TripleByte's question set. You may be right that the best way to get such a dataset is to partner with an organization that is involved in programming education or testing. But if there is a way to get such data online, that would be great.

BTW, I suggest that you paste a few example programs here before investing a big effort in scraping the data. I may have useful feedback.

Once you have the dataset, simply ask the user to apply a sequence to sequence model with attention on this input-output example set, and to try to experiment with all kinds of architectural designs that might improve performance.

ethancaballero commented 8 years ago

Here's an input-output (Description-Solution respectively) example from the dataset; (the sections labelled Input/Output/Sample_Input/Sample_Output are all part of the Description):

Description: For any positive integer, we define a digit rotation as either moving the first digit to the end of the number (left digit rotation), or the last digit to the front of the number (right digit rotation). For example, the number 12345 could be left digit rotated to 23451, or right digit rotated to 51234. If there are any leading zeros after digit rotation, they must be removed. So 10203 could be left digit rotated to 2031, then left digit rotated again to 312. Given an integer N, determine the largest integer that can result from performing a series of one or more digit rotations on N.

Input: Input will begin with an integer T (at most 1000), the number of test cases. Each test case consists of a positive integer N<100000000 (10^8) on a line by itself.

Output: For each test case, print the largest integer that can result from performing one or more digit rotations on N.

Sample Input: 6 12345 54321 10901 211011 7 90

Sample Output: 51234 54321 11090 211011 7 9

Solution:

def left_rotate(s):
    s = s[-1]+s[:-1]
    s = s.lstrip('0')
    return s

def right_rotate(s):
    s = s[1:]+s[0]
    s = s.lstrip('0')
    return s

t = int(raw_input())
while t :
    t=t-1
    n = raw_input()
    ans = max(int(left_rotate(right_rotate(n))),int(right_rotate(left_rotate(n))))
    temp = n[:]
    for i in range(len(n)) :
        temp = left_rotate(temp)
        ans = max(ans,int(temp))
    temp = n[:]
    for i in range(len(n)) :
        temp = right_rotate(temp)
        ans = max(ans,int(temp))
    print ans

ethancaballero commented 8 years ago

The practice problems section on Codechef.com is where I'm currently scraping description-solution pairs from; click on some of the practice problems that the link leads to to see what other pairs will look like.

During the testing phase of the ml model, the Sample Input and Sample Output can be used to verify whether or not each solution that the model generates is correct.

I've also tried scraping other programming_challenge websites but have found that CodeChef is the easiest website to scrape that contains a large number of description-solution pairs. The only possible downside to CodeChef is that ~half of the challenges have a goofy ~2 sentence character-driven backstory in the description that might introduce undesirable noise.

Also, I originally was scraping HackerRank as well, but decided to switch to just CodeChef when I realized all the math equations in the english translation of HackerRank's problem descriptions are rendered as SVG images.

ilyasu123 commented 8 years ago

This is a great example. The problem is definitely ambitious, so don't expect a solution to pop up in the near term. The supervised learning techniques of today are probably inadequate, and will require real advances. On the other hand, this is a good stimulant of research.

How many input-output examples of this kind do you think you'd be able to collect?

ethancaballero commented 8 years ago

somewhere between 1000 and 5000, and I can augment the data by collecting multiple (~10 to 20?) solutions for every description.

ilyasu123 commented 8 years ago

This is a great dataset and it's worth collecting. However, it's a legitimately small dataset by modern ML standards, especially for a problem as difficult as this. Expect it to be solved only once we get transfer learning so good that we could train a system on lots of other data, so that it'll have an easy time "getting" programming from these examples. However, I'll gladly to accept the pull request (with a few tweaks) once the dataset gets into a good shape.

ilyasu123 commented 8 years ago

I should set the expectation: the problem will probably remain unsolved for a significant (by ML standards) amount of time.

ethancaballero commented 8 years ago

I just discovered an easy way to scrape CodeForces and HackerEarth as well, so I think I can collect ~9000 examples now.

ilyasu123 commented 8 years ago

Great!

ilyasu123 commented 8 years ago

How is data collection going?

ethancaballero commented 8 years ago

Approximately half of codechef (which contains ~4000) is scraped as of now. I'm working on parallelizing the scraping code I'm using because it's pretty slow in its current serial form. After codechef is done, I have to finish scraping hackerrank, codeforces, and hackerearth to get to ~9000.

Here's a link to a sample of ~500 description-solution code pairs from the codechef scrape and the code I've been using for scraping: https://github.com/ethancaballero/description2code-data-sample

The file_structure/formatting is the same as that which will be used for the finished dataset. If you see any problems with the dataset sample let me know.

ilyasu123 commented 8 years ago

The dataset looks good. Extremely ambitious. Will push online once the data is ready.

ilyasu123 commented 8 years ago

Any updates?

ethancaballero commented 8 years ago

I currently have 5300 scraped. I’m working on scraping remaining sites. Also, I’m working on normalizing the formatting of descriptions of problems so that all problems have descriptions of a similar format.

Here’s a link to current 5300 with a README.md describing ways to use the dataset for curriculum learning and ways to benchmark which types of algorithms one’s model is capable of learning/generating: https://docs.google.com/uc?id=0Bz3fihKG133ceWNFQTQ5S0xhZUk&export=download

ilyasu123 commented 8 years ago

Great work, thanks for collecting this data! This problem is extremely hard for today's ML, but there's no harm putting it out there :)

ilyasu123 commented 8 years ago

Also submit a PR to the problem description once you finish scraping the remainder of the problems.

ilyasu123 commented 8 years ago

Do you want me to link to your github username, stating that you collected the data?

ethancaballero commented 8 years ago

Linking to my github username is fine. When I'm done with all the scraping, switch the username link to a link to a repo that I'll post with the scraping code (similar to how im2latex contains a link to Miffyli's dataset creation tools code repo).

tlbtlbtlb commented 8 years ago

The descriptions seem to lose newlines inside the example input and output, which are important in most of these problems. For example, in d2c_current/codeforces/280_C/description/description.txt (compare to http://codeforces.com/problemset/problem/280/C) the two lines "2" and "1 2" are merged into a single line "21 2".

On Thu, Jul 28, 2016 at 6:54 PM, Ethan Caballero notifications@github.com wrote:

I currently have 5300 scraped. I’m working on scraping remaining sites. Also, I’m working on adjusting the formatting of descriptions of problems so that all problems have descriptions of a similar format.

Here’s a link to current 5300 with a README.md describing ways to use the dataset for curriculum learning and ways to benchmark which types of algorithms one’s model is capable of learning/generating: https://docs.google.com/uc?id=0Bz3fihKG133ceWNFQTQ5S0xhZUk&export=download

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/openai/requests-for-research/pull/5#issuecomment-236075101, or mute the thread https://github.com/notifications/unsubscribe-auth/AANZdFjcoadICyz_qtvZebxrKKfWkGFUks5qaV1tgaJpZM4Ixkyb .

Trevor Blackwell tlb@openai.com 650 776 7870

ethancaballero commented 8 years ago

Ok, working on fixing newline error(s). Let me know if anyone finds any other errors or has any suggestions.

Avmb commented 7 years ago

What about generating python functions from their docstrings? It should be easy to collect hundred of thousands, if not millions of examples.

ethancaballero commented 7 years ago

That's a cool idea. Here's a github scraper & preprocessor: https://github.com/uclmr/pycodesuggest/tree/master/github-scraper https://github.com/uclmr/pycodesuggest#step-1-cloning-the-repos

We would also need to use collected gold code as oracles to create multiple test cases to test whether (and reward the model if) generated code satisfies functional specifications that docstring describes. To create test cases, random input arguments would be passed to oracle to get its returned outputs which are then paired with corresponding inputs' arguments.

An alternative would be to find urls that already contain test cases that correspond to a code and its docstring/description (similar to the way most programming competition websites provide dozens of scrapable corresponding test cases for each code)

Avmb commented 7 years ago

The Github scraper seems well-suited for the task.

However, I'm afraid that automatically generating sensible test cases for arbitrary code snippets, especially for a dynamic language like python, would be very hard. It is in fact a research problem on its own.

Maybe it is better if we initially evaluate on textual similarity (maybe up to alpha transformations or smth) and then move to functional similarity at a later stage.

0bserver07 commented 7 years ago

@Avmb That would be neat, though in terms of learning end-to-end it rather becomes a hard task, since doc-strings are usually programmer friendly, very brief and tend to over-fit with it's length.

In order to generalize enough over the test-cases of out-putting a working code, going down the route of having code parsed as AST in contrast to human readable text can be considered easier.

That being said, there are some resources in this area that might be useful, here is the list:

Cog python-based code generation toolkit developed by Ned Batchelder This can allow learned doc-string to code in an easier manner.
swagger-codegen, this is used in writing test-cases through a DSL (that can be learned, since of the thousands of examples given)

Another neat trick that can be compared to your doc-string approach is using Pycco Generate documentation from already written code, then learn to generate them back:

:
- pycco = todo/todos/__init__.py -> docs/todo/todos/__init__.html
```
from django.db import models
```

=== Models for Todos app ===

class ListItem(models.Model): """ The ListItem class defines the main storage point for todos. Each todo has two fields: text - stores the text of the todo is_visible - used to control if the todo is displayed on screen """

text = models.CharField(max_length=300)
is_visible = models.BooleanField()



re: your second comment :)
Yes. that is an initial approach towards applying current existing NN architectures, then moving into contextual (or in this scenario) Functional similarity through some type of reinforcement learning.

gokul-uf commented 7 years ago

@ilyasu123 This suggestion might sound like a cheat, but honestly, why not have a standardized representation of network models and training parameters that comes with a paper? This could be something like UML or XML. Then the problem reduces to creating a parser for the file format and each framework (Keras / Theano / TensorFlow / Torch ) could have its own code generator to create appropriate code.

I believe that this would help in greatly reducing the time to recreate models published in papers, and can also detect if the authors are intentionally trying to avoid mentioning some hyperparameters.

One could argue that what if there are new models coming up that our markup cannot represent (like what happened with Caffe and RNNs), that's fine, we release a new version of the markup which is backward compatible with the older versions.

ethancaballero commented 7 years ago

https://github.com/EdinburghNLP/code-docstring-corpus

Jeremy-sense commented 2 years ago

This set is really helpful. 1) Is there an update for the link https://docs.google.com/uc?id=0Bz3fihKG133ceWNFQTQ5S0xhZUk&export=download ? That link no longer works. 2) Are all the solutions included correct? (I am guessing not based on the discussion of unit tests) I am interested in finding sets where there are mistakes, so we can train a model to recognize (common) mistakes and help students. [if this should be listed as an issue and not a comment here, I am sorry and happy to move it] Thanks.

openai / requests-for-research

Add Description2Code to request-for-research #5

=== Models for Todos app ===