openjournals / joss-reviews

Reviews for the Journal of Open Source Software
Creative Commons Zero v1.0 Universal
721 stars 38 forks source link

[REVIEW]: Reiz: Structural Source Code Search at Scale #3296

Closed whedon closed 3 years ago

whedon commented 3 years ago

Submitting author: @isidentical (Batuhan Taskaya) Repository: https://github.com/reizio/reiz.io Version: v1.0.0 Editor: @mjsottile Reviewer: @lutzhamel, @yuhc Archive: 10.5281/zenodo.5029255

:warning: JOSS reduced service mode :warning:

Due to the challenges of the COVID-19 pandemic, JOSS is currently operating in a "reduced service mode". You can read more about what that means in our blog post.

Status

status

Status badge code:

HTML: <a href="https://joss.theoj.org/papers/f56ffa0b5dde25caf549a0f4a4d05604"><img src="https://joss.theoj.org/papers/f56ffa0b5dde25caf549a0f4a4d05604/status.svg"></a>
Markdown: [![status](https://joss.theoj.org/papers/f56ffa0b5dde25caf549a0f4a4d05604/status.svg)](https://joss.theoj.org/papers/f56ffa0b5dde25caf549a0f4a4d05604)

Reviewers and authors:

Please avoid lengthy details of difficulties in the review thread. Instead, please create a new issue in the target repository and link to those issues (especially acceptance-blockers) by leaving comments in the review thread below. (For completists: if the target issue tracker is also on GitHub, linking the review thread in the issue or vice versa will create corresponding breadcrumb trails in the link target.)

Reviewer instructions & questions

@lutzhamel & @yuhc, please carry out your review in this issue by updating the checklist below. If you cannot edit the checklist please:

  1. Make sure you're logged in to your GitHub account
  2. Be sure to accept the invite at this URL: https://github.com/openjournals/joss-reviews/invitations

The reviewer guidelines are available here: https://joss.readthedocs.io/en/latest/reviewer_guidelines.html. Any questions/concerns please let @mjsottile know.

Please start on your review when you are able, and be sure to complete your review in the next six weeks, at the very latest

Review checklist for @lutzhamel

Conflict of interest

Code of Conduct

General checks

Functionality

Documentation

Software paper

Review checklist for @yuhc

Conflict of interest

Code of Conduct

General checks

Functionality

Documentation

Software paper

whedon commented 3 years ago

Hello human, I'm @whedon, a robot that can help you with some common editorial tasks. @lutzhamel, @yuhc it looks like you're currently assigned to review this paper :tada:.

:warning: JOSS reduced service mode :warning:

Due to the challenges of the COVID-19 pandemic, JOSS is currently operating in a "reduced service mode". You can read more about what that means in our blog post.

:star: Important :star:

If you haven't already, you should seriously consider unsubscribing from GitHub notifications for this (https://github.com/openjournals/joss-reviews) repository. As a reviewer, you're probably currently watching this repository which means for GitHub's default behaviour you will receive notifications (emails) for all reviews 😿

To fix this do the following two things:

  1. Set yourself as 'Not watching' https://github.com/openjournals/joss-reviews:

watching

  1. You may also like to change your default settings for this watching repositories in your GitHub profile here: https://github.com/settings/notifications

notifications

For a list of things I can do to help you, just type:

@whedon commands

For example, to regenerate the paper pdf after making changes in the paper's md or bib files, type:

@whedon generate pdf
whedon commented 3 years ago
Software report (experimental):

github.com/AlDanial/cloc v 1.88  T=0.14 s (755.4 files/s, 55795.4 lines/s)
-------------------------------------------------------------------------------
Language                     files          blank        comment           code
-------------------------------------------------------------------------------
Python                          80           1346            173           4098
Markdown                         8            215              0            919
JSON                             4              0              0            901
YAML                             7             13              2            130
TeX                              1              9              0             81
Bourne Shell                     4             14             13             43
DOS Batch                        1              8              1             26
Dockerfile                       2              8              0             17
TOML                             1              2              0             12
make                             1              4              7              9
-------------------------------------------------------------------------------
SUM:                           109           1619            196           6236
-------------------------------------------------------------------------------

Statistical information for the repository '768d7180239c666227fd7a05' was
gathered on 2021/05/20.
The following historical commit information, by author, was found:

Author                     Commits    Insertions      Deletions    % of changes
Batuhan Taskaya                177         11703           6086          100.00

Below are the number of rows from each author that have survived and are still
intact in the current revision:

Author                     Rows      Stability          Age       % in comments
Batuhan Taskaya            5617           48.0          3.6                2.99
whedon commented 3 years ago
Reference check summary (note 'MISSING' DOIs are suggestions that need verification):

OK DOIs

- 10.1016/j.scico.2012.04.008 is OK
- 10.5281/zenodo.4657163 is OK
- 10.1007/s10664-017-9514-4 is OK

MISSING DOIs

- None

INVALID DOIs

- None
whedon commented 3 years ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

yuhc commented 3 years ago

Hi @mjsottile, I've tried to run the software within the latest docker on Ubuntu 20.04, but it requires a subtle change and doesn't return search results as expected. Shall I contact the author to resolve this issue? (The document doesn't state enough about the environment settings.)

isidentical commented 3 years ago

Please let me know about any problems you experience to build / run the Reiz on the bug tracker @yuhc. Thanks!

yuhc commented 3 years ago

Please let me know about any problems you experience to build / run the Reiz on the bug tracker @yuhc. Thanks!

Hi @isidentical, I'm running Ubuntu 20.04, Docker 19.03.13, docker-compose 1.26.2. I had to change docker-compose's version from 3.9 to 3.8 to build and start the container.

The build instruction is usually expected to contain the system environment.

I didn't see API is running on ... on the server end, but I could access the search engine from 8080 port (remotely, as I don't have a desktop environment) and tried to search Call(Name("len")). However, the loading icon never disappeared, and no results was ever returned.

Could you explain more about what the dataset is built on top of ("~75 files from 10 different projects") and help check what's going wrong here? Also, it'll be great to know the use cases of reiz.io.

isidentical commented 3 years ago

Hi @isidentical, I'm running Ubuntu 20.04, Docker 19.03.13, docker-compose 1.26.2. I had to change docker-compose's version from 3.9 to 3.8 to build and start the container.

Interesting, I can successfully spin the instances with 3.9 though will definitely investigate (would you mind opening an issue on the tracker).

I didn't see API is running on ... on the server end, but I could access the search engine from 8080 port (remotely, as I don't have a desktop environment) and tried to search Call(Name("len"))

You should wait the API to start before accessing, since without it the web ui will just wait and timeout eventually. Would you mind sending me the logs (btw it would be better if you could create an issue on the repo itself!)

Thanks!

yuhc commented 3 years ago

Interesting, I can successfully spin the instances with 3.9 though will definitely investigate (would you mind opening an issue on the tracker).

Created https://github.com/reizio/reiz.io/issues/51.

You should wait the API to start before accessing, since without it the web ui will just wait and timeout eventually. Would you mind sending me the logs (btw it would be better if you could create an issue on the repo itself!)

Created https://github.com/reizio/reiz.io/issues/52.

lutzhamel commented 3 years ago

Hi @isidentical, perhaps I missed it, but I don't see any instructions in the repo on how to run this software. Could you point me to the spot where it tells me how to install and run the software?

Thanks.

lutzhamel commented 3 years ago

@isidentical, never mind just found it under the docs link....

lutzhamel commented 3 years ago

@isidentical, docker-compose does not run on the given files...

ubuntu@ip-172-31-94-52:~$ git clone https://github.com/reizio/reiz.io
Cloning into 'reiz.io'...
remote: Enumerating objects: 1884, done.
remote: Counting objects: 100% (362/362), done.
remote: Compressing objects: 100% (225/225), done.
remote: Total 1884 (delta 206), reused 264 (delta 132), pack-reused 1522
Receiving objects: 100% (1884/1884), 586.58 KiB | 24.44 MiB/s, done.
Resolving deltas: 100% (1098/1098), done.
ubuntu@ip-172-31-94-52:~$ ls
reiz.io
ubuntu@ip-172-31-94-52:~$ cd reiz.io/
ubuntu@ip-172-31-94-52:~/reiz.io$ docker-compose up --build --remove-orphans
ERROR: Version in "./docker-compose.yml" is unsupported. You might be seeing this error because you're using the wrong Compose file version. Either specify a supported version (e.g "2.2" or "3.3") and place your service definitions under the `services` key, or omit the `version` key and place your service definitions at the root of the file to use version 1.
For more on the Compose file format versions, see https://docs.docker.com/compose/compose-file/
ubuntu@ip-172-31-94-52:~/reiz.io$ 
isidentical commented 3 years ago

What docker version are you using @lutzhamel? Please ensure you are using a newer one, something like 19.03.13.

lutzhamel commented 3 years ago

@isidentical, here is what I am using:

ubuntu@ip-172-31-94-52:~/reiz.io$ docker --version
Docker version 20.10.6, build 370c289
ubuntu@ip-172-31-94-52:~/reiz.io$ docker-compose --version
docker-compose version 1.25.0, build unknown
ubuntu@ip-172-31-94-52:~/reiz.io$ 
yuhc commented 3 years ago

Hi @isidentical , I just uploaded a screenshot of the 8000 page to https://github.com/reizio/reiz.io/issues/52, and hope it would be helpful for your debugging. While I don't think the review is going in the right direction and I think I should state it clearly in case you may not know:

  1. The repo, website or article proof should contain the running environment of your software. You should either support the code for multiple environments, or state the supported environments clearly.
    1. I'm using AWS c4.4xlarge instance, Ubuntu 20.04 and the latest packages you can get on 20.04 by default to run the code. IDK what settings @lutzhamel is using, but the expected testing environment should be provided by you.
    2. Otherwise, you should get a c4 instance and fix your code to work there. You can rent a c4.xlarge for a few hours and it will just cost you a buck.
  2. The reviewers aren't responsible for helping with the debug.
    1. This is a basic criterion for any peer-review journal or conference, and I believe it also applies to JOSS.
    2. From now on I'll wait for the setup results from @lutzhamel and updated setting instructions from you, otherwise the process will be a Pandora's box for the reviewers.
isidentical commented 3 years ago

Thanks for your comment @yuhc! It is a bit of a block box situation for myself too, since everything is simply reproducible on my environment. I'll try to get everything setup on an AWS machine this weekend and let you both know about the exact environment. Thanks for your patience.

lutzhamel commented 3 years ago

@isidentical, have you considered setting up a working instance on a web based virtual machine like https://replit.com/ eliminating set up issues all together?

@yuhc, thanks for your comments. I agree, as reviewers we should not be wrangling software, we should be just verifying that it works as advertised. I too will be waiting for a working instance before continuing the review.

isidentical commented 3 years ago

@isidentical, have you considered setting up a working instance on a web based virtual machine like https://replit.com/ eliminating set up issues all together?

That is a great idea! i'll create an open instance as well as the instructions on a clear aws machine. Sorry for all the inconvenience i caused!

isidentical commented 3 years ago

@yuhc @lutzhamel I've deployed a public instance on the web address: https://reiz.io. Unfortunately, I wasn't able to give exact instructions for an AWS instance since they still haven't approved my personal account though as you stated, if any of you want to have access to the environment the server is running I can give access (it is running on a digital ocean VPS right now). Please let me know if this method works for you or not, thanks!

whedon commented 3 years ago

:wave: @yuhc, please update us on how your review is going (this is an automated reminder).

whedon commented 3 years ago

:wave: @lutzhamel, please update us on how your review is going (this is an automated reminder).

yuhc commented 3 years ago

Thanks for setting up reiz.io. It works for me. Could you update the docs to include your exact setup steps on DigitalOcean? I don't think it really matters which cloud service you choose. The main reason why we asked you to install reiz.io on a clean VM is that we want a reproducible instruction.

BTW, registering an AWS account should just take a few minutes. I guess something went wrong, and you may need to contact the support.

isidentical commented 3 years ago

Could you update the docs to include your exact setup steps on DigitalOcean? I don't think it really matters which cloud service you choose. The main reason why we asked you to install reiz.io on a clean VM is that we want a reproducible instruction.

Sure! I think the docker-compose should just work in any case (the errors should have gone by now (it was a browser thing as far as I can locate with the confusion that the API is served on another port and the website is another port), since I've also made a couple changes for supporting port forwarding over ssh) but I'll double check it. As far as the reproducibility goes, I'll also include a document to install it in the deploy mode (docker is for creating a toy environment with limited data, but this installation method installs ~500 projects unlike 10 on docker).

isidentical commented 3 years ago

My AWS account is also now got approved, so here is the exact steps to use reiz with only docker/docker-compose and ssh; https://reizio.readthedocs.io/en/latest/installation-to-aws.html

yuhc commented 3 years ago

Hi @isidentical- I took a look at the updated README, and it looks good in general. There are some nits:

yuhc commented 3 years ago

Hi @mjsottile, the article proof link is broken. Could you help fix it?

yuhc commented 3 years ago

While waiting for the article proof, I have a question for you, @isidentical: The software is about code search at scale, but the database setup seems to take a long time and there's no experiment showing the query latency for scaled datasets. Is there any evidence with which you claim the search system to be "at scale"? (Something like algorithms, complexity analysis would be fine. Just for reference.)

mjsottile commented 3 years ago

@whedon generate pdf

whedon commented 3 years ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

isidentical commented 3 years ago

but the database setup seems to take a long time

Yes, indeed. Reiz does multiple phases while aggregating data to the database. This spans from the actual downloading of the projects to generate a one-time node-tag (a fingerprint of the node) and since all of these have to be done in the setup/indexing part, neither users nor the running system has to pay the cost after installation (on the actual run).

there's no experiment showing the query latency for scaled datasets.

I'm not sure where to put it (like in paper or the software docs), but yes definitely I can index data at the real-world level (50k+ unique files from ~1000 projects) and let you know about the latency (which would probably be performed directly over the server with the API, to not effected by other factors). I have a few examples where Reiz is used in the development of CPython (to collect data, e.g whether to deprecate a usage or not / checking out how many arguments have annotations separated across lines for a new feature), and I intend to use those queries in the benchmarks (let me know if you have any extra queries that you want me to add to that list). It would be a bit hard to compare it with other providers (just like we did at the evaluation), though it might help to validate the point of at scale.

(Something like algorithms, complexity analysis would be fine. Just for reference.)

at scale term originated from the fact that for other tools that tried to do the same (search source code directly on AST, but not as in the manner of 'search engine') performed worse compared to Reiz when you'd run them on a giant dataset (50k files for example).

There is a small 'internal use' utility under Reiz called ./scripts/debug_query.py which prints out the AST and IR for the given query (it requires to be installed in full mode, with the configuration being set). It is helpful when you want to see the complexity of a single query. Here is a simple example, and here is the generated IR of the most complex query that I have. The complexity increases by the number of 'checks' you are performing. I don't think it would make sense to put this into the scale of big-O, though abstract implementation tries to be as linear as possible. It used to create a new block every time you used some of the advanced matches, but now they are all flattened out thanks to https://github.com/reizio/reiz.io/pull/12.

yuhc commented 3 years ago

Seems there are some misunderstandings of "at scale". Reiz is a good tool, and I just wanted to know whether its title is a bit overstated.

This spans from the actual downloading of the projects to generate a one-time node-tag

That's the concern. How long does it take to generate a dataset for 1k projects, 100k, 1M, 10M...? 50k files, 500k, 5M, 50M...? 50k files and 1k projects are of small scale to be honest. In most cases, the generation/update also has to be done on-line for a search engine, but that is out of scope.

I'm not sure where to put it (like in paper or the software docs)

Most framework- or system-level software has performance analysis reports in their documentation (for example, a "Performance" page in https://reizio.readthedocs.io/en/latest/reizql.html). I will not doubt its importance, but I do not force you to measure the performance at this moment as reiz is not yet a production-ready system. I am pretty sure you will need it in the future.

but yes definitely I can index data at the real-world level (50k+ unique files from ~1000 projects) and let you know about the latency (which would probably be performed directly over the server with the API, to not effected by other factors).

As I said, this data scale is not really impressive. And please remember, there is not only single access to reiz: thousands or tens of thousands requests could be sent simultaneously. I will not call a system "at scale" if it cannot prove that its ability to handle a significant size of datasets and requests.

Why I asked for latency: If there are measurements for 50k, 500k, 5m files and 100, 1k, 10k requests, and they show similar latencies, reiz then proves that it is scalable.

I have a few examples where Reiz is used in the development of CPython

It is not persuasive unless there are numbers provided. (Update: sorry that I misread the sentence initially.)

at scale term originated from the fact that for other tools that tried to do the same (search source code directly on AST, but not as in the manner of 'search engine') performed worse compared to Reiz when you'd run them on a giant dataset (50k files for example).

That is "efficiently" instead of "at scale". There is a simple question that you may think about: assume reiz can support indexing C/C++/Java files, could it run on a Linux kernel dataset efficiently? Chromium? AOSP? Again, without any perf numbers, a potential user of reiz would only try reiz with small-sized personal projects. "At scale" is meaningless in such a case.

isidentical commented 3 years ago

Fair points! I wasn't imagining on those points, and my comparison was with limited tools. I'll try to update the docs with benchmarks on the data that I can easily collect through the already integrated sampling process, and also drop the at scale tag at the end. Thanks!

lutzhamel commented 3 years ago

Hello:

I have some comments on the current state of the manuscript/software:

  1. The documentation does not explain how to set up the software package for a user specific project, GitHub or otherwise. It only demonstrates the functionality on some prepackaged set of projects.
  2. The query language feels rough and the syntax is unwieldy, creating a huge barrier of entry to the package. Basically the user is asked to construct AST snippets in the precise format that CPython specifies that act as templates against the DB. I am wondering if query languages for graph databases can give some inspiration for an easier query syntax.
  3. Finally, why not incorporate regular expressions into the match_string_pattern?

@mjsottile, given @yuhc comments and given my comments above, I feel that the software is not mature enough at this point to warrant publication. I feel that the software needs to be deployed in some real-life settings and demonstrate benefits before the paper can be accepted.

isidentical commented 3 years ago

Hello @lutzhamel!

The documentation does not explain how to set up the software package for a user specific project, GitHub or otherwise. It only demonstrates the functionality on some prepackaged set of projects.

Yes, that was intentional. Even though this is very trivial to do so, and possible (any package hosted on some sort remote git repo can be used, through https://github.com/reizio/reiz.io/blob/cff3cc6eaad532ac1a956c1f7c7a58d97ea00e4b/reiz/sampling/data.py#L8-L19, which corresponds to the dataset.json file entries) I'd avoided this since i wanted to keep Reiz as a general purpose search engine. Reiz also employs the design of pluggable-parts, so the implementors (in this case reiz.io) provides the sampling strategies. The default strategy is fetching most downloaded packages from the Python's package index, PyPI though there have been others (even though not turned on by default), such as fetching most popular projects from github by either the number of stars or the last update.

The query language feels rough and the syntax is unwieldy, creating a huge barrier of entry to the package. Basically the user is asked to construct AST snippets in the precise format that CPython specifies that act as templates against the DB. I am wondering if query languages for graph databases can give some inspiration for an easier query syntax.

I'd have to disagree with you on this point. The language is entirely free from the CPython or whatever implementation's AST (as described in the Grammar appendix). When you are implementing a variant for Reiz, you feed an ASDL file which then creates those 'matchers'. In the reference implementation I proposed, the ASDL file originated from CPython (see this). If you were to implement another language this would be totally different.

However I agree that ReizQL as is a low level language for querying source code, and that is also intentional. I even wrote a high level interface for Python called irun. The following query is written in a python superset where you are searching code that looks exactly the same with some expanded fragments which is then compiled to the ReizQL

Query:

with open(...) as $stream:
    tree = ast.parse($stream.read())

ReizQL version:

With(
    items=[
        withitem(
            context_expr=Call(func=Name(id="open"), args=[...], keywords=[]),
            optional_vars=~stream,
        )
    ],
    body=[
        Assign(
            targets=[Name(id="tree")],
            value=Call(
                func=Attribute(value=Name(id="ast"), attr="parse"),
                args=[
                    Call(
                        func=Attribute(value=~stream, attr="read"), args=[], keywords=[]
                    )
                ],
                keywords=[],
            ),
        )
    ],
)

As you said, writing the second form by hand would create a really high level of entry. Though as I tried to explain, Reiz is the just the execution engine. This is merely a syntactical sugar on top of it.

Finally, why not incorporate regular expressions into the match_string_pattern?

When you meant to match some names (e.g finding all tests written in pytest/unittest fashion, FunctionDef(f'test_%')) you often search for prefixes or suffixes. Of course I don't deny the need of having something rather complicated, but in terms of execution cost that would be very expensive compared to a very stricter but still powerful alternative (LIKE syntax in SQL). Since the underlying engine have native support for incorporating regexes, one day, when the need arises it would be possible to implement this fairly easily, though some implementations might need to validate / check the complexity of input regexes or otherwise it might be performing really bad since we don't have trigram indexes unlike other engines that support it.

@mjsottile, given @yuhc comments and given my comments above, I feel that the software is not mature enough at this point to warrant publication. I feel that the software needs to be deployed in some real-life settings and demonstrate benefits before the paper can be accepted.

Even though that saddens me, I do understand the source of worries. Thank you both for your reviews/suggestions.

yuhc commented 3 years ago

Hi @isidentical, I think @lutzhamel made a really good point. When you advertise your software, you need to document its features and probably also limitations well. Any undocumented feature is assumed not to exist, while anything implied from the context but undocumented in the limitations is supposed to be supported.

From the article proof and repo, we expect to see the instructions and examples for an efficient (scalable, previously), user-friendly code search on a user-specific project. Your document should demonstrate these points or admit some shortcomings. I've talked about the efficiency for which no data is provided. @lutzhamel mentioned user-friendliness and generality of the software, and I'd like to put a question onto your plate more explicitly:

I'd also like to mention that AST-based code search and programming, or more generally code-to-code code search and programming, have been researched for decades. There are plenty of papers that you can find and read to improve Reiz and its documentation.

mjsottile commented 3 years ago

Hi all - I’ve been a bit buried in day job activities so I’ve been watching the discussion from the sidelines. Thank you for the detailed discussion.

Regarding the review itself:

Comments to @isidentical [note: I'm sharing these comments for your benefit - they may end up ultimately being useful if you choose to write a full paper for a regular research paper venue]:

With respect to some technical details:

Any comments/thoughts/questions?

yuhc commented 3 years ago

It's fine to me to accept Reiz with some minor changes, mainly in its article proof and documentations:

As @mjsottile mentioned, the comparison between string search and tree search, and using DSL for code query has been well researched. @isidentical, I recommend you to investigate those works more deeply and come up with a nicer README for Reiz. For example, you compared Reiz with GitHub code search which is reasonable at a glance. But if you look closer into GitHub code search, you'd find it's based on ElasticSearch in which one can reimplement a scalable, efficient, extensive "ElasicReiz.io" with trivial effort.

Honestly, the academic value of Reiz is low: AST-based code manipulation is not a popular topic anymore, and alternatives to Reiz appeared many years ago. However, I'm interested in seeing Reiz grow into a more mature project like https://github.com/jonatas/fast and https://github.com/takluyver/astsearch. It can be a useful tool/library for Python once it proves its efficiency comparing with existing libraries. That's why I talked about the efficiency so much, and probably why @lutzhamel talked about its programmability. The opportunity that I can see in Reiz is to make it more user-friendly and as fast as possible.

lutzhamel commented 3 years ago

To add to @yuhc comments.  I am not opposed to publishing this paper as long as the following is fulfilled,

  1. It is made clear that this is the evaluation of an engine, that means all the metrics etc that come with evaluating an engine should be included.  See @yuhc comments on this.
  2. It is made clear that the query language is not an end user product but it is a prototype AST-based query language to exercise the engine.

Both of these points are not made clear in the current paper and therefore were the source of my confusion.

mjsottile commented 3 years ago

At this point I believe I have what I need to know from the reviewers @yuhc and @lutzhamel. I need to think a bit since over the span of a few days the recommendation swung from not acceptable to acceptable. Two comments though that I believe must be addressed:

Honestly, the academic value of Reiz is low: AST-based code manipulation is not a popular topic anymore

This is not supported by activity in the literature. I would refrain from dismissing an area of research as part of a review - it is often unproductive and rarely holds up under scrutiny. A cursory search for recent publications in that relatively broad area yields a steady number of papers up to and including this year. AST-based code search is still active in areas including but not limited to: mining code repositories, clone detection, defect detection, and is a component of some software synthesis tools. AST-based code manipulation is certainly active - almost all code transformation methods are based on either direct manipulation of the AST or operations on AST-derived structures.

Similarly:

But if you look closer into GitHub code search, you'd find it's based on ElasticSearch in which one can reimplement a scalable, efficient, extensive "ElasicReiz.io" with trivial effort.

It is also generally not good form to dismiss work in this way. It is one thing to refer to another tool as potentially having high overlap to encourage an author to discuss what is different or to state that it is basically the same, but implemented using different methods. Casually dismissing it as trivial is not very productive.

Thanks again for the reviews and I'll figure out next steps once I've heard thoughts from @isidentical regarding how they would like to proceed.

yuhc commented 3 years ago

Thank you @mjsottile for the comments. Yes I brought those up to give some suggestions for the directions of the improvement, but I didn't mean to dismiss the work or the field. I totally agree with what you said and I think we're talking about different scope of "academic values".

AST-based code search was actively explored in the academia so many years ago and I saw quite some amazing papers when I was waiting for Reiz's bug fix. The field isn't that active now not because it's unimportant: The techniques have been used in many places. However, it's because the field has been well explored. There's of course likely to appear more nice works and findings but one should think about it carefully when she/he wants to dive in (for breaking "academic values").

Reiz is an impressive project and I believe it can be useful in cases. (I personally really admire @isidentical for writing this software alone in his age.) The example of ElasticSearch is supposed to tell that it's important to look around and learn from existing works to avoid overstating any features or novelty. Sorry for the confusion.

lutzhamel commented 3 years ago

@mjsottile, yes, my review went from unacceptable to a possible accept once I understood what I was looking at. In my view the paper/software would not have held up as a "end user" product but as an engine it is acceptable provided that the paper addresses the points I made in an earlier comment.

isidentical commented 3 years ago

Thanks @yuhc @mjsottile @lutzhamel for your comments! I'm not sure if you still want to proceed or not, though by looking at the last comments I was able to collect a list of stuff to change in the documentation/paper itself and went with them ( https://github.com/reizio/reiz.io/commit/64ecc9b754d3698a328d984fc0f08bf5db21a5f9).

(I don't know how to change the title of this issue, though I dropped the at Scale part from the paper itself. )

isidentical commented 3 years ago

@whedon generate pdf

whedon commented 3 years ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

lutzhamel commented 3 years ago

OK. That looks pretty good to me. One major sticking point, and a point that originally got me confused about what I am looking at, remains: The section "Sampling Source Code". It is written as if the engine can be only used in this context. This should be rewritten in order to make explicit that this is the experiment in order to validate the engine. The user is free to index any source code project they want.

With this rewrite I'd be happy to suggest the paper for publication.

-L

isidentical commented 3 years ago

@whedon generate pdf

whedon commented 3 years ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

lutzhamel commented 3 years ago

Looks good to me. From my perspective I am happy to recommend the paper for publication at this point.

yuhc commented 3 years ago

The only missing piece is that the article doesn't mention existing AST-based search engines or libraries in "State of the field". It looks good to me to move forward after the relevant citations are added.

isidentical commented 3 years ago

@whedon generate pdf