[REVIEW]: Reiz: Structural Source Code Search at Scale

whedon commented 3 years ago

Submitting author: @isidentical (Batuhan Taskaya) Repository: https://github.com/reizio/reiz.io Version: v1.0.0 Editor: @mjsottile Reviewer: @lutzhamel, @yuhc Archive: 10.5281/zenodo.5029255

:warning: JOSS reduced service mode :warning:

Due to the challenges of the COVID-19 pandemic, JOSS is currently operating in a "reduced service mode". You can read more about what that means in our blog post.

Status

Status badge code:

HTML: <a href="https://joss.theoj.org/papers/f56ffa0b5dde25caf549a0f4a4d05604"><img src="https://joss.theoj.org/papers/f56ffa0b5dde25caf549a0f4a4d05604/status.svg"></a>
Markdown: [![status](https://joss.theoj.org/papers/f56ffa0b5dde25caf549a0f4a4d05604/status.svg)](https://joss.theoj.org/papers/f56ffa0b5dde25caf549a0f4a4d05604)

Reviewers and authors:

Please avoid lengthy details of difficulties in the review thread. Instead, please create a new issue in the target repository and link to those issues (especially acceptance-blockers) by leaving comments in the review thread below. (For completists: if the target issue tracker is also on GitHub, linking the review thread in the issue or vice versa will create corresponding breadcrumb trails in the link target.)

Reviewer instructions & questions

@lutzhamel & @yuhc, please carry out your review in this issue by updating the checklist below. If you cannot edit the checklist please:

Make sure you're logged in to your GitHub account
Be sure to accept the invite at this URL: https://github.com/openjournals/joss-reviews/invitations

The reviewer guidelines are available here: https://joss.readthedocs.io/en/latest/reviewer_guidelines.html. Any questions/concerns please let @mjsottile know.

✨ Please start on your review when you are able, and be sure to complete your review in the next six weeks, at the very latest ✨

Review checklist for @lutzhamel

Conflict of interest

[x] I confirm that I have read the JOSS conflict of interest (COI) policy and that: I have no COIs with reviewing this work or that any perceived COIs have been waived by JOSS for the purpose of this review.

Code of Conduct

[x] I confirm that I read and will adhere to the JOSS code of conduct.

General checks

[x] Repository: Is the source code for this software available at the repository url?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?
[x] Contribution and authorship: Has the submitting author (@isidentical) made major contributions to the software? Does the full list of paper authors seem appropriate and complete?
[x] Substantial scholarly effort: Does this submission meet the scope eligibility described in the JOSS guidelines

Functionality

[ ] Installation: Does installation proceed as outlined in the documentation?
[ ] Functionality: Have the functional claims of the software been confirmed?
[ ] Performance: If there are any performance claims of the software, have they been confirmed? (If there are no claims, please check off this item.)

Documentation

[ ] A statement of need: Do the authors clearly state what problems the software is designed to solve and who the target audience is?
[ ] Installation instructions: Is there a clearly-stated list of dependencies? Ideally these should be handled with an automated package management solution.
[ ] Example usage: Do the authors include examples of how to use the software (ideally to solve real-world analysis problems).
[ ] Functionality documentation: Is the core functionality of the software documented to a satisfactory level (e.g., API method documentation)?
[ ] Automated tests: Are there automated tests or manual steps described so that the functionality of the software can be verified?
[ ] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Software paper

[x] Summary: Has a clear description of the high-level functionality and purpose of the software for a diverse, non-specialist audience been provided?
[x] A statement of need: Does the paper have a section titled 'Statement of Need' that clearly states what problems the software is designed to solve and who the target audience is?
[x] State of the field: Do the authors describe how this software compares to other commonly-used packages?
[x] Quality of writing: Is the paper well written (i.e., it does not require editing for structure, language, or writing quality)?
[x] References: Is the list of references complete, and is everything cited appropriately that should be cited (e.g., papers, datasets, software)? Do references in the text use the proper citation syntax?

Review checklist for @yuhc

Conflict of interest

[x] I confirm that I have read the JOSS conflict of interest (COI) policy and that: I have no COIs with reviewing this work or that any perceived COIs have been waived by JOSS for the purpose of this review.

Code of Conduct

[x] I confirm that I read and will adhere to the JOSS code of conduct.

General checks

[x] Repository: Is the source code for this software available at the repository url?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?
[x] Contribution and authorship: Has the submitting author (@isidentical) made major contributions to the software? Does the full list of paper authors seem appropriate and complete?
[x] Substantial scholarly effort: Does this submission meet the scope eligibility described in the JOSS guidelines

Functionality

[x] Installation: Does installation proceed as outlined in the documentation?
[x] Functionality: Have the functional claims of the software been confirmed?
[ ] Performance: If there are any performance claims of the software, have they been confirmed? (If there are no claims, please check off this item.)

Documentation

[ ] A statement of need: Do the authors clearly state what problems the software is designed to solve and who the target audience is?
[ ] Installation instructions: Is there a clearly-stated list of dependencies? Ideally these should be handled with an automated package management solution.
[x] Example usage: Do the authors include examples of how to use the software (ideally to solve real-world analysis problems).
[ ] Functionality documentation: Is the core functionality of the software documented to a satisfactory level (e.g., API method documentation)?
[ ] Automated tests: Are there automated tests or manual steps described so that the functionality of the software can be verified?
[ ] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Software paper

[ ] Summary: Has a clear description of the high-level functionality and purpose of the software for a diverse, non-specialist audience been provided?
[ ] A statement of need: Does the paper have a section titled 'Statement of Need' that clearly states what problems the software is designed to solve and who the target audience is?
[ ] State of the field: Do the authors describe how this software compares to other commonly-used packages?
[ ] Quality of writing: Is the paper well written (i.e., it does not require editing for structure, language, or writing quality)?
[ ] References: Is the list of references complete, and is everything cited appropriately that should be cited (e.g., papers, datasets, software)? Do references in the text use the proper citation syntax?

whedon commented 3 years ago

Hello human, I'm @whedon, a robot that can help you with some common editorial tasks. @lutzhamel, @yuhc it looks like you're currently assigned to review this paper :tada:.

:warning: JOSS reduced service mode :warning:

Due to the challenges of the COVID-19 pandemic, JOSS is currently operating in a "reduced service mode". You can read more about what that means in our blog post.

:star: Important :star:

If you haven't already, you should seriously consider unsubscribing from GitHub notifications for this (https://github.com/openjournals/joss-reviews) repository. As a reviewer, you're probably currently watching this repository which means for GitHub's default behaviour you will receive notifications (emails) for all reviews 😿

To fix this do the following two things:

Set yourself as 'Not watching' https://github.com/openjournals/joss-reviews:

watching

You may also like to change your default settings for this watching repositories in your GitHub profile here: https://github.com/settings/notifications

notifications

For a list of things I can do to help you, just type:

@whedon commands

For example, to regenerate the paper pdf after making changes in the paper's md or bib files, type:

@whedon generate pdf

whedon commented 3 years ago

Software report (experimental):

github.com/AlDanial/cloc v 1.88  T=0.14 s (755.4 files/s, 55795.4 lines/s)
-------------------------------------------------------------------------------
Language                     files          blank        comment           code
-------------------------------------------------------------------------------
Python                          80           1346            173           4098
Markdown                         8            215              0            919
JSON                             4              0              0            901
YAML                             7             13              2            130
TeX                              1              9              0             81
Bourne Shell                     4             14             13             43
DOS Batch                        1              8              1             26
Dockerfile                       2              8              0             17
TOML                             1              2              0             12
make                             1              4              7              9
-------------------------------------------------------------------------------
SUM:                           109           1619            196           6236
-------------------------------------------------------------------------------

Statistical information for the repository '768d7180239c666227fd7a05' was
gathered on 2021/05/20.
The following historical commit information, by author, was found:

Author                     Commits    Insertions      Deletions    % of changes
Batuhan Taskaya                177         11703           6086          100.00

Below are the number of rows from each author that have survived and are still
intact in the current revision:

Author                     Rows      Stability          Age       % in comments
Batuhan Taskaya            5617           48.0          3.6                2.99

whedon commented 3 years ago

Reference check summary (note 'MISSING' DOIs are suggestions that need verification):

OK DOIs

- 10.1016/j.scico.2012.04.008 is OK
- 10.5281/zenodo.4657163 is OK
- 10.1007/s10664-017-9514-4 is OK

MISSING DOIs

- None

INVALID DOIs

- None

whedon commented 3 years ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

yuhc commented 3 years ago

Hi @mjsottile, I've tried to run the software within the latest docker on Ubuntu 20.04, but it requires a subtle change and doesn't return search results as expected. Shall I contact the author to resolve this issue? (The document doesn't state enough about the environment settings.)

isidentical commented 3 years ago

Please let me know about any problems you experience to build / run the Reiz on the bug tracker @yuhc. Thanks!

yuhc commented 3 years ago

Please let me know about any problems you experience to build / run the Reiz on the bug tracker @yuhc. Thanks!

Hi @isidentical, I'm running Ubuntu 20.04, Docker 19.03.13, docker-compose 1.26.2. I had to change docker-compose's version from 3.9 to 3.8 to build and start the container.

The build instruction is usually expected to contain the system environment.

I didn't see API is running on ... on the server end, but I could access the search engine from 8080 port (remotely, as I don't have a desktop environment) and tried to search Call(Name("len")). However, the loading icon never disappeared, and no results was ever returned.

Could you explain more about what the dataset is built on top of ("~75 files from 10 different projects") and help check what's going wrong here? Also, it'll be great to know the use cases of reiz.io.

isidentical commented 3 years ago

Hi @isidentical, I'm running Ubuntu 20.04, Docker 19.03.13, docker-compose 1.26.2. I had to change docker-compose's version from 3.9 to 3.8 to build and start the container.

Interesting, I can successfully spin the instances with 3.9 though will definitely investigate (would you mind opening an issue on the tracker).

I didn't see API is running on ... on the server end, but I could access the search engine from 8080 port (remotely, as I don't have a desktop environment) and tried to search Call(Name("len"))

You should wait the API to start before accessing, since without it the web ui will just wait and timeout eventually. Would you mind sending me the logs (btw it would be better if you could create an issue on the repo itself!)

Thanks!

yuhc commented 3 years ago

Interesting, I can successfully spin the instances with 3.9 though will definitely investigate (would you mind opening an issue on the tracker).

Created https://github.com/reizio/reiz.io/issues/51.

You should wait the API to start before accessing, since without it the web ui will just wait and timeout eventually. Would you mind sending me the logs (btw it would be better if you could create an issue on the repo itself!)

Created https://github.com/reizio/reiz.io/issues/52.

lutzhamel commented 3 years ago

Hi @isidentical, perhaps I missed it, but I don't see any instructions in the repo on how to run this software. Could you point me to the spot where it tells me how to install and run the software?

Thanks.

lutzhamel commented 3 years ago

@isidentical, never mind just found it under the docs link....

lutzhamel commented 3 years ago

@isidentical, docker-compose does not run on the given files...

ubuntu@ip-172-31-94-52:~$ git clone https://github.com/reizio/reiz.io
Cloning into 'reiz.io'...
remote: Enumerating objects: 1884, done.
remote: Counting objects: 100% (362/362), done.
remote: Compressing objects: 100% (225/225), done.
remote: Total 1884 (delta 206), reused 264 (delta 132), pack-reused 1522
Receiving objects: 100% (1884/1884), 586.58 KiB | 24.44 MiB/s, done.
Resolving deltas: 100% (1098/1098), done.
ubuntu@ip-172-31-94-52:~$ ls
reiz.io
ubuntu@ip-172-31-94-52:~$ cd reiz.io/
ubuntu@ip-172-31-94-52:~/reiz.io$ docker-compose up --build --remove-orphans
ERROR: Version in "./docker-compose.yml" is unsupported. You might be seeing this error because you're using the wrong Compose file version. Either specify a supported version (e.g "2.2" or "3.3") and place your service definitions under the `services` key, or omit the `version` key and place your service definitions at the root of the file to use version 1.
For more on the Compose file format versions, see https://docs.docker.com/compose/compose-file/
ubuntu@ip-172-31-94-52:~/reiz.io$

isidentical commented 3 years ago

What docker version are you using @lutzhamel? Please ensure you are using a newer one, something like 19.03.13.

lutzhamel commented 3 years ago

@isidentical, here is what I am using:

ubuntu@ip-172-31-94-52:~/reiz.io$ docker --version
Docker version 20.10.6, build 370c289
ubuntu@ip-172-31-94-52:~/reiz.io$ docker-compose --version
docker-compose version 1.25.0, build unknown
ubuntu@ip-172-31-94-52:~/reiz.io$

yuhc commented 3 years ago

Hi @isidentical , I just uploaded a screenshot of the 8000 page to https://github.com/reizio/reiz.io/issues/52, and hope it would be helpful for your debugging. While I don't think the review is going in the right direction and I think I should state it clearly in case you may not know:

The repo, website or article proof should contain the running environment of your software. You should either support the code for multiple environments, or state the supported environments clearly.
1. I'm using AWS c4.4xlarge instance, Ubuntu 20.04 and the latest packages you can get on 20.04 by default to run the code. IDK what settings @lutzhamel is using, but the expected testing environment should be provided by you.
2. Otherwise, you should get a c4 instance and fix your code to work there. You can rent a c4.xlarge for a few hours and it will just cost you a buck.
The reviewers aren't responsible for helping with the debug.
1. This is a basic criterion for any peer-review journal or conference, and I believe it also applies to JOSS.
2. From now on I'll wait for the setup results from @lutzhamel and updated setting instructions from you, otherwise the process will be a Pandora's box for the reviewers.

isidentical commented 3 years ago

Thanks for your comment @yuhc! It is a bit of a block box situation for myself too, since everything is simply reproducible on my environment. I'll try to get everything setup on an AWS machine this weekend and let you both know about the exact environment. Thanks for your patience.

lutzhamel commented 3 years ago

@isidentical, have you considered setting up a working instance on a web based virtual machine like https://replit.com/ eliminating set up issues all together?

@yuhc, thanks for your comments. I agree, as reviewers we should not be wrangling software, we should be just verifying that it works as advertised. I too will be waiting for a working instance before continuing the review.

isidentical commented 3 years ago

@isidentical, have you considered setting up a working instance on a web based virtual machine like https://replit.com/ eliminating set up issues all together?

That is a great idea! i'll create an open instance as well as the instructions on a clear aws machine. Sorry for all the inconvenience i caused!

isidentical commented 3 years ago

@yuhc @lutzhamel I've deployed a public instance on the web address: https://reiz.io. Unfortunately, I wasn't able to give exact instructions for an AWS instance since they still haven't approved my personal account though as you stated, if any of you want to have access to the environment the server is running I can give access (it is running on a digital ocean VPS right now). Please let me know if this method works for you or not, thanks!

whedon commented 3 years ago

:wave: @yuhc, please update us on how your review is going (this is an automated reminder).

whedon commented 3 years ago

:wave: @lutzhamel, please update us on how your review is going (this is an automated reminder).

yuhc commented 3 years ago

Thanks for setting up reiz.io. It works for me. Could you update the docs to include your exact setup steps on DigitalOcean? I don't think it really matters which cloud service you choose. The main reason why we asked you to install reiz.io on a clean VM is that we want a reproducible instruction.

BTW, registering an AWS account should just take a few minutes. I guess something went wrong, and you may need to contact the support.

isidentical commented 3 years ago

Could you update the docs to include your exact setup steps on DigitalOcean? I don't think it really matters which cloud service you choose. The main reason why we asked you to install reiz.io on a clean VM is that we want a reproducible instruction.

Sure! I think the docker-compose should just work in any case (the errors should have gone by now (it was a browser thing as far as I can locate with the confusion that the API is served on another port and the website is another port), since I've also made a couple changes for supporting port forwarding over ssh) but I'll double check it. As far as the reproducibility goes, I'll also include a document to install it in the deploy mode (docker is for creating a toy environment with limited data, but this installation method installs ~500 projects unlike 10 on docker).

isidentical commented 3 years ago

My AWS account is also now got approved, so here is the exact steps to use reiz with only docker/docker-compose and ssh; https://reizio.readthedocs.io/en/latest/installation-to-aws.html

yuhc commented 3 years ago

Hi @isidentical- I took a look at the updated README, and it looks good in general. There are some nits:

The login section can be removed. Port forwarding is actually not necessary or recommended.
git fetch and git reset --hard lines can be removed.
User should expect to visit reiz via instance_ip:8000 instead of localhost:8000. ** Or you can remove anything related to AWS and change the instruction to the one for local Ubuntu 18.04.

yuhc commented 3 years ago

Hi @mjsottile, the article proof link is broken. Could you help fix it?

yuhc commented 3 years ago

While waiting for the article proof, I have a question for you, @isidentical: The software is about code search at scale, but the database setup seems to take a long time and there's no experiment showing the query latency for scaled datasets. Is there any evidence with which you claim the search system to be "at scale"? (Something like algorithms, complexity analysis would be fine. Just for reference.)

mjsottile commented 3 years ago

@whedon generate pdf

whedon commented 3 years ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

isidentical commented 3 years ago

but the database setup seems to take a long time

Yes, indeed. Reiz does multiple phases while aggregating data to the database. This spans from the actual downloading of the projects to generate a one-time node-tag (a fingerprint of the node) and since all of these have to be done in the setup/indexing part, neither users nor the running system has to pay the cost after installation (on the actual run).

there's no experiment showing the query latency for scaled datasets.

I'm not sure where to put it (like in paper or the software docs), but yes definitely I can index data at the real-world level (50k+ unique files from ~1000 projects) and let you know about the latency (which would probably be performed directly over the server with the API, to not effected by other factors). I have a few examples where Reiz is used in the development of CPython (to collect data, e.g whether to deprecate a usage or not / checking out how many arguments have annotations separated across lines for a new feature), and I intend to use those queries in the benchmarks (let me know if you have any extra queries that you want me to add to that list). It would be a bit hard to compare it with other providers (just like we did at the evaluation), though it might help to validate the point of at scale.

(Something like algorithms, complexity analysis would be fine. Just for reference.)

at scale term originated from the fact that for other tools that tried to do the same (search source code directly on AST, but not as in the manner of 'search engine') performed worse compared to Reiz when you'd run them on a giant dataset (50k files for example).

There is a small 'internal use' utility under Reiz called ./scripts/debug_query.py which prints out the AST and IR for the given query (it requires to be installed in full mode, with the configuration being set). It is helpful when you want to see the complexity of a single query. Here is a simple example, and here is the generated IR of the most complex query that I have. The complexity increases by the number of 'checks' you are performing. I don't think it would make sense to put this into the scale of big-O, though abstract implementation tries to be as linear as possible. It used to create a new block every time you used some of the advanced matches, but now they are all flattened out thanks to https://github.com/reizio/reiz.io/pull/12.

yuhc commented 3 years ago

Seems there are some misunderstandings of "at scale". Reiz is a good tool, and I just wanted to know whether its title is a bit overstated.

This spans from the actual downloading of the projects to generate a one-time node-tag

That's the concern. How long does it take to generate a dataset for 1k projects, 100k, 1M, 10M...? 50k files, 500k, 5M, 50M...? 50k files and 1k projects are of small scale to be honest. In most cases, the generation/update also has to be done on-line for a search engine, but that is out of scope.

I'm not sure where to put it (like in paper or the software docs)

Most framework- or system-level software has performance analysis reports in their documentation (for example, a "Performance" page in https://reizio.readthedocs.io/en/latest/reizql.html). I will not doubt its importance, but I do not force you to measure the performance at this moment as reiz is not yet a production-ready system. I am pretty sure you will need it in the future.

but yes definitely I can index data at the real-world level (50k+ unique files from ~1000 projects) and let you know about the latency (which would probably be performed directly over the server with the API, to not effected by other factors).

As I said, this data scale is not really impressive. And please remember, there is not only single access to reiz: thousands or tens of thousands requests could be sent simultaneously. I will not call a system "at scale" if it cannot prove that its ability to handle a significant size of datasets and requests.

Why I asked for latency: If there are measurements for 50k, 500k, 5m files and 100, 1k, 10k requests, and they show similar latencies, reiz then proves that it is scalable.

I have a few examples where Reiz is used in the development of CPython

It is not persuasive unless there are numbers provided. (Update: sorry that I misread the sentence initially.)

at scale term originated from the fact that for other tools that tried to do the same (search source code directly on AST, but not as in the manner of 'search engine') performed worse compared to Reiz when you'd run them on a giant dataset (50k files for example).

That is "efficiently" instead of "at scale". There is a simple question that you may think about: assume reiz can support indexing C/C++/Java files, could it run on a Linux kernel dataset efficiently? Chromium? AOSP? Again, without any perf numbers, a potential user of reiz would only try reiz with small-sized personal projects. "At scale" is meaningless in such a case.

isidentical commented 3 years ago

Fair points! I wasn't imagining on those points, and my comparison was with limited tools. I'll try to update the docs with benchmarks on the data that I can easily collect through the already integrated sampling process, and also drop the at scale tag at the end. Thanks!

lutzhamel commented 3 years ago

Hello:

I have some comments on the current state of the manuscript/software:

The documentation does not explain how to set up the software package for a user specific project, GitHub or otherwise. It only demonstrates the functionality on some prepackaged set of projects.
The query language feels rough and the syntax is unwieldy, creating a huge barrier of entry to the package. Basically the user is asked to construct AST snippets in the precise format that CPython specifies that act as templates against the DB. I am wondering if query languages for graph databases can give some inspiration for an easier query syntax.
Finally, why not incorporate regular expressions into the match_string_pattern?

@mjsottile, given @yuhc comments and given my comments above, I feel that the software is not mature enough at this point to warrant publication. I feel that the software needs to be deployed in some real-life settings and demonstrate benefits before the paper can be accepted.

isidentical commented 3 years ago

Hello @lutzhamel!

The documentation does not explain how to set up the software package for a user specific project, GitHub or otherwise. It only demonstrates the functionality on some prepackaged set of projects.

Yes, that was intentional. Even though this is very trivial to do so, and possible (any package hosted on some sort remote git repo can be used, through https://github.com/reizio/reiz.io/blob/cff3cc6eaad532ac1a956c1f7c7a58d97ea00e4b/reiz/sampling/data.py#L8-L19, which corresponds to the dataset.json file entries) I'd avoided this since i wanted to keep Reiz as a general purpose search engine. Reiz also employs the design of pluggable-parts, so the implementors (in this case reiz.io) provides the sampling strategies. The default strategy is fetching most downloaded packages from the Python's package index, PyPI though there have been others (even though not turned on by default), such as fetching most popular projects from github by either the number of stars or the last update.

The query language feels rough and the syntax is unwieldy, creating a huge barrier of entry to the package. Basically the user is asked to construct AST snippets in the precise format that CPython specifies that act as templates against the DB. I am wondering if query languages for graph databases can give some inspiration for an easier query syntax.

I'd have to disagree with you on this point. The language is entirely free from the CPython or whatever implementation's AST (as described in the Grammar appendix). When you are implementing a variant for Reiz, you feed an ASDL file which then creates those 'matchers'. In the reference implementation I proposed, the ASDL file originated from CPython (see this). If you were to implement another language this would be totally different.

However I agree that ReizQL as is a low level language for querying source code, and that is also intentional. I even wrote a high level interface for Python called irun. The following query is written in a python superset where you are searching code that looks exactly the same with some expanded fragments which is then compiled to the ReizQL

Query:

with open(...) as $stream:
    tree = ast.parse($stream.read())

ReizQL version:

With(
    items=[
        withitem(
            context_expr=Call(func=Name(id="open"), args=[...], keywords=[]),
            optional_vars=~stream,
        )
    ],
    body=[
        Assign(
            targets=[Name(id="tree")],
            value=Call(
                func=Attribute(value=Name(id="ast"), attr="parse"),
                args=[
                    Call(
                        func=Attribute(value=~stream, attr="read"), args=[], keywords=[]
                    )
                ],
                keywords=[],
            ),
        )
    ],
)

As you said, writing the second form by hand would create a really high level of entry. Though as I tried to explain, Reiz is the just the execution engine. This is merely a syntactical sugar on top of it.

Finally, why not incorporate regular expressions into the match_string_pattern?

When you meant to match some names (e.g finding all tests written in pytest/unittest fashion, FunctionDef(f'test_%')) you often search for prefixes or suffixes. Of course I don't deny the need of having something rather complicated, but in terms of execution cost that would be very expensive compared to a very stricter but still powerful alternative (LIKE syntax in SQL). Since the underlying engine have native support for incorporating regexes, one day, when the need arises it would be possible to implement this fairly easily, though some implementations might need to validate / check the complexity of input regexes or otherwise it might be performing really bad since we don't have trigram indexes unlike other engines that support it.

@mjsottile, given @yuhc comments and given my comments above, I feel that the software is not mature enough at this point to warrant publication. I feel that the software needs to be deployed in some real-life settings and demonstrate benefits before the paper can be accepted.

Even though that saddens me, I do understand the source of worries. Thank you both for your reviews/suggestions.

yuhc commented 3 years ago

Hi @isidentical, I think @lutzhamel made a really good point. When you advertise your software, you need to document its features and probably also limitations well. Any undocumented feature is assumed not to exist, while anything implied from the context but undocumented in the limitations is supposed to be supported.

From the article proof and repo, we expect to see the instructions and examples for an efficient (scalable, previously), user-friendly code search on a user-specific project. Your document should demonstrate these points or admit some shortcomings. I've talked about the efficiency for which no data is provided. @lutzhamel mentioned user-friendliness and generality of the software, and I'd like to put a question onto your plate more explicitly:

When would we want to write tens of lines of ReizQL to search code with Reiz, instead of using a single line of regexpr with any existing code search engine that supports fuzzy search?

I'd also like to mention that AST-based code search and programming, or more generally code-to-code code search and programming, have been researched for decades. There are plenty of papers that you can find and read to improve Reiz and its documentation.

mjsottile commented 3 years ago

Hi all - I’ve been a bit buried in day job activities so I’ve been watching the discussion from the sidelines. Thank you for the detailed discussion.

Regarding the review itself:

Remember that the criteria for a JOSS paper are slightly different than would be applied to a full journal or conference paper. JOSS papers are typically quite short and do not have space for extensive evaluations of performance, usability, and so on. For reference, the criteria are here : https://joss.readthedocs.io/en/latest/review_criteria.html .
There is no requirement that JOSS papers be deployed in "real-world" settings or in any production-class environment for acceptance. They must reflect scholarly work of relevance to the research community. See the same link for our definition of “substantial scholarly effort”.
Research artifacts are often works in progress (there is always more to do given time and resources). We don't gate acceptance on having an empty todo list, so even if there might be something useful to have (e.g., regex as part of a query), that's not a strong justification to not accept.
That said, I am inclined to think that publishing this paper in JOSS is premature. JOSS papers are typically too small to sufficiently describe the research work itself. I think that since Reiz is a research project in and of itself, and not just a tool used as part of a different research project, it would be worth writing up a proper paper. That paper would be related to the research effort around the code search method where you have the space to go into detail about how it works, how it relates to prior work, and perform a detailed assessment related to scaling and/or efficiency. This seems like a perfectly valid submission to the software engineering or programming languages literature. If such a paper was written that established those details, then a JOSS submission that focuses specifically on the software artifact itself would be much easer to justify as acceptable since it would have the other paper to reference.

Comments to @isidentical [note: I'm sharing these comments for your benefit - they may end up ultimately being useful if you choose to write a full paper for a regular research paper venue]:

You should be conscious of making sure certain things are not overstated without evidence, such as performance or scaling. I would consider dropping things like "at scale" from a paper title if scaling isn't clearly demonstrated. The same goes for claims of efficiency.
You should improve the statement of need. This comment thread makes that apparent - the paper makes some mention of the deficiencies of text-oriented search due to the fact that program code represents a structure, not linear text. You can also see that in the results where alternative search mechanisms have high false positive counts. Calling that out and explaining WHY Reiz improves on it should be clearer. The comparison section could also use some additional text explaining the results in the tables to make it clear what you are showing and what the source of false positives are.
Other parts of the paper need work as well. I would revise the state of the field, explaining the results in the evaluation/comparison section, and potentially adding the example from https://github.com/openjournals/joss-reviews/issues/3296#issuecomment-860960347 in order to illustrate the query DSL and the tree structure it desugars to. I also noticed that the references just sort of start without a section heading indicating where the reference section starts and the previous section ends - I would add a section heading in for that.

With respect to some technical details:

String search and tree search are very different beasts - regexp is appropriate for searching the code as linear text, while tree matching is appropriate for searching the parse tree or AST derived from the text. The fact this came up in the thread above tells me the paper doesn’t do an adequate job explaining why tree matching is more appropriate for code search.
The ReizQL mechanism appears similar to other tree search methods in which a template is defined in terms of the node constructors and corresponding subtrees labeled with names holes. Even if one was to move to something like a generic graph query language to write the pattern, you would still need the query to be aware of the specific types of nodes that could appear in the template. So the underlying tree representation would bubble up into the query anyways. The syntactic sugar mentioned in a previous comment (https://github.com/openjournals/joss-reviews/issues/3296#issuecomment-860960347 ) looks like other systems I've seen where a DSL is used to reduce the burden on the query writer to know the specifics of the actual underlying tree representation. I think the approach taken is pretty reasonable since I've seen the same approach taken in other projects with similar goals.

Any comments/thoughts/questions?

yuhc commented 3 years ago

It's fine to me to accept Reiz with some minor changes, mainly in its article proof and documentations:

Rewrite the documentation with what Reiz is able to do now, something like: Reiz is an AST-based code search engine which runs ReizQL to search AST patterns in the prebuilt database of Python codes. Feel free to decorate the sentence.
State the use cases of Reiz as @mjsottile suggested.
Cite related papers and open-source projects in the documentation.

As @mjsottile mentioned, the comparison between string search and tree search, and using DSL for code query has been well researched. @isidentical, I recommend you to investigate those works more deeply and come up with a nicer README for Reiz. For example, you compared Reiz with GitHub code search which is reasonable at a glance. But if you look closer into GitHub code search, you'd find it's based on ElasticSearch in which one can reimplement a scalable, efficient, extensive "ElasicReiz.io" with trivial effort.

Honestly, the academic value of Reiz is low: AST-based code manipulation is not a popular topic anymore, and alternatives to Reiz appeared many years ago. However, I'm interested in seeing Reiz grow into a more mature project like https://github.com/jonatas/fast and https://github.com/takluyver/astsearch. It can be a useful tool/library for Python once it proves its efficiency comparing with existing libraries. That's why I talked about the efficiency so much, and probably why @lutzhamel talked about its programmability. The opportunity that I can see in Reiz is to make it more user-friendly and as fast as possible.

lutzhamel commented 3 years ago

To add to @yuhc comments. I am not opposed to publishing this paper as long as the following is fulfilled,

It is made clear that this is the evaluation of an engine, that means all the metrics etc that come with evaluating an engine should be included. See @yuhc comments on this.
It is made clear that the query language is not an end user product but it is a prototype AST-based query language to exercise the engine.

Both of these points are not made clear in the current paper and therefore were the source of my confusion.

mjsottile commented 3 years ago

At this point I believe I have what I need to know from the reviewers @yuhc and @lutzhamel. I need to think a bit since over the span of a few days the recommendation swung from not acceptable to acceptable. Two comments though that I believe must be addressed:

Honestly, the academic value of Reiz is low: AST-based code manipulation is not a popular topic anymore

This is not supported by activity in the literature. I would refrain from dismissing an area of research as part of a review - it is often unproductive and rarely holds up under scrutiny. A cursory search for recent publications in that relatively broad area yields a steady number of papers up to and including this year. AST-based code search is still active in areas including but not limited to: mining code repositories, clone detection, defect detection, and is a component of some software synthesis tools. AST-based code manipulation is certainly active - almost all code transformation methods are based on either direct manipulation of the AST or operations on AST-derived structures.

Similarly:

But if you look closer into GitHub code search, you'd find it's based on ElasticSearch in which one can reimplement a scalable, efficient, extensive "ElasicReiz.io" with trivial effort.

It is also generally not good form to dismiss work in this way. It is one thing to refer to another tool as potentially having high overlap to encourage an author to discuss what is different or to state that it is basically the same, but implemented using different methods. Casually dismissing it as trivial is not very productive.

Thanks again for the reviews and I'll figure out next steps once I've heard thoughts from @isidentical regarding how they would like to proceed.

yuhc commented 3 years ago

Thank you @mjsottile for the comments. Yes I brought those up to give some suggestions for the directions of the improvement, but I didn't mean to dismiss the work or the field. I totally agree with what you said and I think we're talking about different scope of "academic values".

AST-based code search was actively explored in the academia so many years ago and I saw quite some amazing papers when I was waiting for Reiz's bug fix. The field isn't that active now not because it's unimportant: The techniques have been used in many places. However, it's because the field has been well explored. There's of course likely to appear more nice works and findings but one should think about it carefully when she/he wants to dive in (for breaking "academic values").

Reiz is an impressive project and I believe it can be useful in cases. (I personally really admire @isidentical for writing this software alone in his age.) The example of ElasticSearch is supposed to tell that it's important to look around and learn from existing works to avoid overstating any features or novelty. Sorry for the confusion.

lutzhamel commented 3 years ago

@mjsottile, yes, my review went from unacceptable to a possible accept once I understood what I was looking at. In my view the paper/software would not have held up as a "end user" product but as an engine it is acceptable provided that the paper addresses the points I made in an earlier comment.

isidentical commented 3 years ago

Thanks @yuhc @mjsottile @lutzhamel for your comments! I'm not sure if you still want to proceed or not, though by looking at the last comments I was able to collect a list of stuff to change in the documentation/paper itself and went with them ( https://github.com/reizio/reiz.io/commit/64ecc9b754d3698a328d984fc0f08bf5db21a5f9).

(I don't know how to change the title of this issue, though I dropped the at Scale part from the paper itself. )

isidentical commented 3 years ago

@whedon generate pdf

whedon commented 3 years ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

lutzhamel commented 3 years ago

OK. That looks pretty good to me. One major sticking point, and a point that originally got me confused about what I am looking at, remains: The section "Sampling Source Code". It is written as if the engine can be only used in this context. This should be rewritten in order to make explicit that this is the experiment in order to validate the engine. The user is free to index any source code project they want.

With this rewrite I'd be happy to suggest the paper for publication.

-L

isidentical commented 3 years ago

@whedon generate pdf

whedon commented 3 years ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

lutzhamel commented 3 years ago

Looks good to me. From my perspective I am happy to recommend the paper for publication at this point.

yuhc commented 3 years ago

The only missing piece is that the article doesn't mention existing AST-based search engines or libraries in "State of the field". It looks good to me to move forward after the relevant citations are added.

isidentical commented 3 years ago

@whedon generate pdf

openjournals / joss-reviews