Closed whedon closed 3 years ago
Hello human, I'm @whedon, a robot that can help you with some common editorial tasks. @lutzhamel, @yuhc it looks like you're currently assigned to review this paper :tada:.
:warning: JOSS reduced service mode :warning:
Due to the challenges of the COVID-19 pandemic, JOSS is currently operating in a "reduced service mode". You can read more about what that means in our blog post.
:star: Important :star:
If you haven't already, you should seriously consider unsubscribing from GitHub notifications for this (https://github.com/openjournals/joss-reviews) repository. As a reviewer, you're probably currently watching this repository which means for GitHub's default behaviour you will receive notifications (emails) for all reviews 😿
To fix this do the following two things:
For a list of things I can do to help you, just type:
@whedon commands
For example, to regenerate the paper pdf after making changes in the paper's md or bib files, type:
@whedon generate pdf
Software report (experimental):
github.com/AlDanial/cloc v 1.88 T=0.14 s (755.4 files/s, 55795.4 lines/s)
-------------------------------------------------------------------------------
Language files blank comment code
-------------------------------------------------------------------------------
Python 80 1346 173 4098
Markdown 8 215 0 919
JSON 4 0 0 901
YAML 7 13 2 130
TeX 1 9 0 81
Bourne Shell 4 14 13 43
DOS Batch 1 8 1 26
Dockerfile 2 8 0 17
TOML 1 2 0 12
make 1 4 7 9
-------------------------------------------------------------------------------
SUM: 109 1619 196 6236
-------------------------------------------------------------------------------
Statistical information for the repository '768d7180239c666227fd7a05' was
gathered on 2021/05/20.
The following historical commit information, by author, was found:
Author Commits Insertions Deletions % of changes
Batuhan Taskaya 177 11703 6086 100.00
Below are the number of rows from each author that have survived and are still
intact in the current revision:
Author Rows Stability Age % in comments
Batuhan Taskaya 5617 48.0 3.6 2.99
Reference check summary (note 'MISSING' DOIs are suggestions that need verification):
OK DOIs
- 10.1016/j.scico.2012.04.008 is OK
- 10.5281/zenodo.4657163 is OK
- 10.1007/s10664-017-9514-4 is OK
MISSING DOIs
- None
INVALID DOIs
- None
:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:
Hi @mjsottile, I've tried to run the software within the latest docker on Ubuntu 20.04, but it requires a subtle change and doesn't return search results as expected. Shall I contact the author to resolve this issue? (The document doesn't state enough about the environment settings.)
Please let me know about any problems you experience to build / run the Reiz on the bug tracker @yuhc. Thanks!
Please let me know about any problems you experience to build / run the Reiz on the bug tracker @yuhc. Thanks!
Hi @isidentical, I'm running Ubuntu 20.04, Docker 19.03.13, docker-compose 1.26.2. I had to change docker-compose's version from 3.9 to 3.8 to build and start the container.
The build instruction is usually expected to contain the system environment.
I didn't see API is running on ...
on the server end, but I could access the search engine from 8080 port (remotely, as I don't have a desktop environment) and tried to search Call(Name("len"))
. However, the loading icon never disappeared, and no results was ever returned.
Could you explain more about what the dataset is built on top of ("~75 files from 10 different projects") and help check what's going wrong here? Also, it'll be great to know the use cases of reiz.io.
Hi @isidentical, I'm running Ubuntu 20.04, Docker 19.03.13, docker-compose 1.26.2. I had to change docker-compose's version from 3.9 to 3.8 to build and start the container.
Interesting, I can successfully spin the instances with 3.9 though will definitely investigate (would you mind opening an issue on the tracker).
I didn't see API is running on ... on the server end, but I could access the search engine from 8080 port (remotely, as I don't have a desktop environment) and tried to search Call(Name("len"))
You should wait the API to start before accessing, since without it the web ui will just wait and timeout eventually. Would you mind sending me the logs (btw it would be better if you could create an issue on the repo itself!)
Thanks!
Interesting, I can successfully spin the instances with 3.9 though will definitely investigate (would you mind opening an issue on the tracker).
Created https://github.com/reizio/reiz.io/issues/51.
You should wait the API to start before accessing, since without it the web ui will just wait and timeout eventually. Would you mind sending me the logs (btw it would be better if you could create an issue on the repo itself!)
Hi @isidentical, perhaps I missed it, but I don't see any instructions in the repo on how to run this software. Could you point me to the spot where it tells me how to install and run the software?
Thanks.
@isidentical, never mind just found it under the docs link....
@isidentical, docker-compose does not run on the given files...
ubuntu@ip-172-31-94-52:~$ git clone https://github.com/reizio/reiz.io
Cloning into 'reiz.io'...
remote: Enumerating objects: 1884, done.
remote: Counting objects: 100% (362/362), done.
remote: Compressing objects: 100% (225/225), done.
remote: Total 1884 (delta 206), reused 264 (delta 132), pack-reused 1522
Receiving objects: 100% (1884/1884), 586.58 KiB | 24.44 MiB/s, done.
Resolving deltas: 100% (1098/1098), done.
ubuntu@ip-172-31-94-52:~$ ls
reiz.io
ubuntu@ip-172-31-94-52:~$ cd reiz.io/
ubuntu@ip-172-31-94-52:~/reiz.io$ docker-compose up --build --remove-orphans
ERROR: Version in "./docker-compose.yml" is unsupported. You might be seeing this error because you're using the wrong Compose file version. Either specify a supported version (e.g "2.2" or "3.3") and place your service definitions under the `services` key, or omit the `version` key and place your service definitions at the root of the file to use version 1.
For more on the Compose file format versions, see https://docs.docker.com/compose/compose-file/
ubuntu@ip-172-31-94-52:~/reiz.io$
What docker version are you using @lutzhamel? Please ensure you are using a newer one, something like 19.03.13
.
@isidentical, here is what I am using:
ubuntu@ip-172-31-94-52:~/reiz.io$ docker --version
Docker version 20.10.6, build 370c289
ubuntu@ip-172-31-94-52:~/reiz.io$ docker-compose --version
docker-compose version 1.25.0, build unknown
ubuntu@ip-172-31-94-52:~/reiz.io$
Hi @isidentical , I just uploaded a screenshot of the 8000 page to https://github.com/reizio/reiz.io/issues/52, and hope it would be helpful for your debugging. While I don't think the review is going in the right direction and I think I should state it clearly in case you may not know:
Thanks for your comment @yuhc! It is a bit of a block box situation for myself too, since everything is simply reproducible on my environment. I'll try to get everything setup on an AWS machine this weekend and let you both know about the exact environment. Thanks for your patience.
@isidentical, have you considered setting up a working instance on a web based virtual machine like https://replit.com/ eliminating set up issues all together?
@yuhc, thanks for your comments. I agree, as reviewers we should not be wrangling software, we should be just verifying that it works as advertised. I too will be waiting for a working instance before continuing the review.
@isidentical, have you considered setting up a working instance on a web based virtual machine like https://replit.com/ eliminating set up issues all together?
That is a great idea! i'll create an open instance as well as the instructions on a clear aws machine. Sorry for all the inconvenience i caused!
@yuhc @lutzhamel I've deployed a public instance on the web address: https://reiz.io
. Unfortunately, I wasn't able to give exact instructions for an AWS instance since they still haven't approved my personal account though as you stated, if any of you want to have access to the environment the server is running I can give access (it is running on a digital ocean VPS right now). Please let me know if this method works for you or not, thanks!
:wave: @yuhc, please update us on how your review is going (this is an automated reminder).
:wave: @lutzhamel, please update us on how your review is going (this is an automated reminder).
Thanks for setting up reiz.io. It works for me. Could you update the docs to include your exact setup steps on DigitalOcean? I don't think it really matters which cloud service you choose. The main reason why we asked you to install reiz.io on a clean VM is that we want a reproducible instruction.
BTW, registering an AWS account should just take a few minutes. I guess something went wrong, and you may need to contact the support.
Could you update the docs to include your exact setup steps on DigitalOcean? I don't think it really matters which cloud service you choose. The main reason why we asked you to install reiz.io on a clean VM is that we want a reproducible instruction.
Sure! I think the docker-compose
should just work in any case (the errors should have gone by now (it was a browser thing as far as I can locate with the confusion that the API is served on another port and the website is another port), since I've also made a couple changes for supporting port forwarding over ssh) but I'll double check it. As far as the reproducibility goes, I'll also include a document to install it in the deploy mode (docker is for creating a toy environment with limited data, but this installation method installs ~500 projects unlike 10 on docker).
My AWS account is also now got approved, so here is the exact steps to use reiz
with only docker
/docker-compose
and ssh; https://reizio.readthedocs.io/en/latest/installation-to-aws.html
Hi @isidentical- I took a look at the updated README, and it looks good in general. There are some nits:
git fetch
and git reset --hard
lines can be removed.instance_ip:8000
instead of localhost:8000
.
** Or you can remove anything related to AWS and change the instruction to the one for local Ubuntu 18.04.Hi @mjsottile, the article proof link is broken. Could you help fix it?
While waiting for the article proof, I have a question for you, @isidentical: The software is about code search at scale, but the database setup seems to take a long time and there's no experiment showing the query latency for scaled datasets. Is there any evidence with which you claim the search system to be "at scale"? (Something like algorithms, complexity analysis would be fine. Just for reference.)
@whedon generate pdf
:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:
but the database setup seems to take a long time
Yes, indeed. Reiz does multiple phases while aggregating data to the database. This spans from the actual downloading of the projects to generate a one-time node-tag (a fingerprint of the node) and since all of these have to be done in the setup/indexing part, neither users nor the running system has to pay the cost after installation (on the actual run).
there's no experiment showing the query latency for scaled datasets.
I'm not sure where to put it (like in paper or the software docs), but yes definitely I can index data at the real-world level (50k+ unique files from ~1000 projects) and let you know about the latency (which would probably be performed directly over the server with the API, to not effected by other factors). I have a few examples where Reiz is used in the development of CPython (to collect data, e.g whether to deprecate a usage or not / checking out how many arguments have annotations separated across lines for a new feature), and I intend to use those queries in the benchmarks (let me know if you have any extra queries that you want me to add to that list). It would be a bit hard to compare it with other providers (just like we did at the evaluation), though it might help to validate the point of at scale
.
(Something like algorithms, complexity analysis would be fine. Just for reference.)
at scale
term originated from the fact that for other tools that tried to do the same (search source code directly on AST, but not as in the manner of 'search engine') performed worse compared to Reiz when you'd run them on a giant dataset (50k files for example).
There is a small 'internal use' utility under Reiz called ./scripts/debug_query.py
which prints out the AST and IR for the given query (it requires to be installed in full mode, with the configuration being set). It is helpful when you want to see the complexity of a single query. Here is a simple example, and here is the generated IR of the most complex query that I have. The complexity increases by the number of 'checks' you are performing. I don't think it would make sense to put this into the scale of big-O, though abstract implementation tries to be as linear as possible. It used to create a new block every time you used some of the advanced matches, but now they are all flattened out thanks to https://github.com/reizio/reiz.io/pull/12.
Seems there are some misunderstandings of "at scale". Reiz is a good tool, and I just wanted to know whether its title is a bit overstated.
This spans from the actual downloading of the projects to generate a one-time node-tag
That's the concern. How long does it take to generate a dataset for 1k projects, 100k, 1M, 10M...? 50k files, 500k, 5M, 50M...? 50k files and 1k projects are of small scale to be honest. In most cases, the generation/update also has to be done on-line for a search engine, but that is out of scope.
I'm not sure where to put it (like in paper or the software docs)
Most framework- or system-level software has performance analysis reports in their documentation (for example, a "Performance" page in https://reizio.readthedocs.io/en/latest/reizql.html). I will not doubt its importance, but I do not force you to measure the performance at this moment as reiz is not yet a production-ready system. I am pretty sure you will need it in the future.
but yes definitely I can index data at the real-world level (50k+ unique files from ~1000 projects) and let you know about the latency (which would probably be performed directly over the server with the API, to not effected by other factors).
As I said, this data scale is not really impressive. And please remember, there is not only single access to reiz: thousands or tens of thousands requests could be sent simultaneously. I will not call a system "at scale" if it cannot prove that its ability to handle a significant size of datasets and requests.
Why I asked for latency: If there are measurements for 50k, 500k, 5m files and 100, 1k, 10k requests, and they show similar latencies, reiz then proves that it is scalable.
I have a few examples where Reiz is used in the development of CPython
It is not persuasive unless there are numbers provided. (Update: sorry that I misread the sentence initially.)
at scale
term originated from the fact that for other tools that tried to do the same (search source code directly on AST, but not as in the manner of 'search engine') performed worse compared to Reiz when you'd run them on a giant dataset (50k files for example).
That is "efficiently" instead of "at scale". There is a simple question that you may think about: assume reiz can support indexing C/C++/Java files, could it run on a Linux kernel dataset efficiently? Chromium? AOSP? Again, without any perf numbers, a potential user of reiz would only try reiz with small-sized personal projects. "At scale" is meaningless in such a case.
Fair points! I wasn't imagining on those points, and my comparison was with limited tools. I'll try to update the docs with benchmarks on the data that I can easily collect through the already integrated sampling process, and also drop the at scale
tag at the end. Thanks!
Hello:
I have some comments on the current state of the manuscript/software:
match_string_pattern
? @mjsottile, given @yuhc comments and given my comments above, I feel that the software is not mature enough at this point to warrant publication. I feel that the software needs to be deployed in some real-life settings and demonstrate benefits before the paper can be accepted.
Hello @lutzhamel!
The documentation does not explain how to set up the software package for a user specific project, GitHub or otherwise. It only demonstrates the functionality on some prepackaged set of projects.
Yes, that was intentional. Even though this is very trivial to do so, and possible (any package hosted on some sort remote git repo can be used, through https://github.com/reizio/reiz.io/blob/cff3cc6eaad532ac1a956c1f7c7a58d97ea00e4b/reiz/sampling/data.py#L8-L19, which corresponds to the dataset.json
file entries) I'd avoided this since i wanted to keep Reiz as a general purpose search engine. Reiz also employs the design of pluggable-parts, so the implementors (in this case reiz.io) provides the sampling strategies. The default strategy is fetching most downloaded packages from the Python's package index, PyPI though there have been others (even though not turned on by default), such as fetching most popular projects from github by either the number of stars or the last update.
The query language feels rough and the syntax is unwieldy, creating a huge barrier of entry to the package. Basically the user is asked to construct AST snippets in the precise format that CPython specifies that act as templates against the DB. I am wondering if query languages for graph databases can give some inspiration for an easier query syntax.
I'd have to disagree with you on this point. The language is entirely free from the CPython or whatever implementation's AST (as described in the Grammar appendix). When you are implementing a variant for Reiz, you feed an ASDL file which then creates those 'matchers'. In the reference implementation I proposed, the ASDL file originated from CPython (see this). If you were to implement another language this would be totally different.
However I agree that ReizQL as is a low level language for querying source code, and that is also intentional. I even wrote a high level interface for Python called irun. The following query is written in a python superset where you are searching code that looks exactly the same with some expanded fragments which is then compiled to the ReizQL
Query:
with open(...) as $stream:
tree = ast.parse($stream.read())
ReizQL version:
With(
items=[
withitem(
context_expr=Call(func=Name(id="open"), args=[...], keywords=[]),
optional_vars=~stream,
)
],
body=[
Assign(
targets=[Name(id="tree")],
value=Call(
func=Attribute(value=Name(id="ast"), attr="parse"),
args=[
Call(
func=Attribute(value=~stream, attr="read"), args=[], keywords=[]
)
],
keywords=[],
),
)
],
)
As you said, writing the second form by hand would create a really high level of entry. Though as I tried to explain, Reiz is the just the execution engine. This is merely a syntactical sugar on top of it.
Finally, why not incorporate regular expressions into the match_string_pattern?
When you meant to match some names (e.g finding all tests written in pytest/unittest fashion, FunctionDef(f'test_%')
) you often search for prefixes or suffixes. Of course I don't deny the need of having something rather complicated, but in terms of execution cost that would be very expensive compared to a very stricter but still powerful alternative (LIKE syntax in SQL). Since the underlying engine have native support for incorporating regexes, one day, when the need arises it would be possible to implement this fairly easily, though some implementations might need to validate / check the complexity of input regexes or otherwise it might be performing really bad since we don't have trigram indexes unlike other engines that support it.
@mjsottile, given @yuhc comments and given my comments above, I feel that the software is not mature enough at this point to warrant publication. I feel that the software needs to be deployed in some real-life settings and demonstrate benefits before the paper can be accepted.
Even though that saddens me, I do understand the source of worries. Thank you both for your reviews/suggestions.
Hi @isidentical, I think @lutzhamel made a really good point. When you advertise your software, you need to document its features and probably also limitations well. Any undocumented feature is assumed not to exist, while anything implied from the context but undocumented in the limitations is supposed to be supported.
From the article proof and repo, we expect to see the instructions and examples for an efficient (scalable, previously), user-friendly code search on a user-specific project. Your document should demonstrate these points or admit some shortcomings. I've talked about the efficiency for which no data is provided. @lutzhamel mentioned user-friendliness and generality of the software, and I'd like to put a question onto your plate more explicitly:
I'd also like to mention that AST-based code search and programming, or more generally code-to-code code search and programming, have been researched for decades. There are plenty of papers that you can find and read to improve Reiz and its documentation.
Hi all - I’ve been a bit buried in day job activities so I’ve been watching the discussion from the sidelines. Thank you for the detailed discussion.
Regarding the review itself:
Comments to @isidentical [note: I'm sharing these comments for your benefit - they may end up ultimately being useful if you choose to write a full paper for a regular research paper venue]:
With respect to some technical details:
Any comments/thoughts/questions?
It's fine to me to accept Reiz with some minor changes, mainly in its article proof and documentations:
As @mjsottile mentioned, the comparison between string search and tree search, and using DSL for code query has been well researched. @isidentical, I recommend you to investigate those works more deeply and come up with a nicer README for Reiz. For example, you compared Reiz with GitHub code search which is reasonable at a glance. But if you look closer into GitHub code search, you'd find it's based on ElasticSearch in which one can reimplement a scalable, efficient, extensive "ElasicReiz.io" with trivial effort.
Honestly, the academic value of Reiz is low: AST-based code manipulation is not a popular topic anymore, and alternatives to Reiz appeared many years ago. However, I'm interested in seeing Reiz grow into a more mature project like https://github.com/jonatas/fast and https://github.com/takluyver/astsearch. It can be a useful tool/library for Python once it proves its efficiency comparing with existing libraries. That's why I talked about the efficiency so much, and probably why @lutzhamel talked about its programmability. The opportunity that I can see in Reiz is to make it more user-friendly and as fast as possible.
To add to @yuhc comments. I am not opposed to publishing this paper as long as the following is fulfilled,
Both of these points are not made clear in the current paper and therefore were the source of my confusion.
At this point I believe I have what I need to know from the reviewers @yuhc and @lutzhamel. I need to think a bit since over the span of a few days the recommendation swung from not acceptable to acceptable. Two comments though that I believe must be addressed:
Honestly, the academic value of Reiz is low: AST-based code manipulation is not a popular topic anymore
This is not supported by activity in the literature. I would refrain from dismissing an area of research as part of a review - it is often unproductive and rarely holds up under scrutiny. A cursory search for recent publications in that relatively broad area yields a steady number of papers up to and including this year. AST-based code search is still active in areas including but not limited to: mining code repositories, clone detection, defect detection, and is a component of some software synthesis tools. AST-based code manipulation is certainly active - almost all code transformation methods are based on either direct manipulation of the AST or operations on AST-derived structures.
Similarly:
But if you look closer into GitHub code search, you'd find it's based on ElasticSearch in which one can reimplement a scalable, efficient, extensive "ElasicReiz.io" with trivial effort.
It is also generally not good form to dismiss work in this way. It is one thing to refer to another tool as potentially having high overlap to encourage an author to discuss what is different or to state that it is basically the same, but implemented using different methods. Casually dismissing it as trivial is not very productive.
Thanks again for the reviews and I'll figure out next steps once I've heard thoughts from @isidentical regarding how they would like to proceed.
Thank you @mjsottile for the comments. Yes I brought those up to give some suggestions for the directions of the improvement, but I didn't mean to dismiss the work or the field. I totally agree with what you said and I think we're talking about different scope of "academic values".
AST-based code search was actively explored in the academia so many years ago and I saw quite some amazing papers when I was waiting for Reiz's bug fix. The field isn't that active now not because it's unimportant: The techniques have been used in many places. However, it's because the field has been well explored. There's of course likely to appear more nice works and findings but one should think about it carefully when she/he wants to dive in (for breaking "academic values").
Reiz is an impressive project and I believe it can be useful in cases. (I personally really admire @isidentical for writing this software alone in his age.) The example of ElasticSearch is supposed to tell that it's important to look around and learn from existing works to avoid overstating any features or novelty. Sorry for the confusion.
@mjsottile, yes, my review went from unacceptable to a possible accept once I understood what I was looking at. In my view the paper/software would not have held up as a "end user" product but as an engine it is acceptable provided that the paper addresses the points I made in an earlier comment.
Thanks @yuhc @mjsottile @lutzhamel for your comments! I'm not sure if you still want to proceed or not, though by looking at the last comments I was able to collect a list of stuff to change in the documentation/paper itself and went with them ( https://github.com/reizio/reiz.io/commit/64ecc9b754d3698a328d984fc0f08bf5db21a5f9).
(I don't know how to change the title of this issue, though I dropped the at Scale
part from the paper itself. )
@whedon generate pdf
:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:
OK. That looks pretty good to me. One major sticking point, and a point that originally got me confused about what I am looking at, remains: The section "Sampling Source Code". It is written as if the engine can be only used in this context. This should be rewritten in order to make explicit that this is the experiment in order to validate the engine. The user is free to index any source code project they want.
With this rewrite I'd be happy to suggest the paper for publication.
-L
@whedon generate pdf
:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:
Looks good to me. From my perspective I am happy to recommend the paper for publication at this point.
The only missing piece is that the article doesn't mention existing AST-based search engines or libraries in "State of the field". It looks good to me to move forward after the relevant citations are added.
@whedon generate pdf
Submitting author: @isidentical (Batuhan Taskaya) Repository: https://github.com/reizio/reiz.io Version: v1.0.0 Editor: @mjsottile Reviewer: @lutzhamel, @yuhc Archive: 10.5281/zenodo.5029255
:warning: JOSS reduced service mode :warning:
Due to the challenges of the COVID-19 pandemic, JOSS is currently operating in a "reduced service mode". You can read more about what that means in our blog post.
Status
Status badge code:
Reviewers and authors:
Please avoid lengthy details of difficulties in the review thread. Instead, please create a new issue in the target repository and link to those issues (especially acceptance-blockers) by leaving comments in the review thread below. (For completists: if the target issue tracker is also on GitHub, linking the review thread in the issue or vice versa will create corresponding breadcrumb trails in the link target.)
Reviewer instructions & questions
@lutzhamel & @yuhc, please carry out your review in this issue by updating the checklist below. If you cannot edit the checklist please:
The reviewer guidelines are available here: https://joss.readthedocs.io/en/latest/reviewer_guidelines.html. Any questions/concerns please let @mjsottile know.
✨ Please start on your review when you are able, and be sure to complete your review in the next six weeks, at the very latest ✨
Review checklist for @lutzhamel
Conflict of interest
Code of Conduct
General checks
Functionality
Documentation
Software paper
Review checklist for @yuhc
Conflict of interest
Code of Conduct
General checks
Functionality
Documentation
Software paper