[REVIEW]: EMBERS: Experimental Measurement of BEam Responses with Satellites

whedon commented 4 years ago

Submitting author: @amanchokshi (Aman Chokshi) Repository: https://github.com/amanchokshi/EMBERS Version: v1.0.0 Editor: @mbobra Reviewers: @teuben, @mbobra Archive: 10.5281/zenodo.4287813

:warning: JOSS reduced service mode :warning:

Due to the challenges of the COVID-19 pandemic, JOSS is currently operating in a "reduced service mode". You can read more about what that means in our blog post.

Status

Status badge code:

HTML: <a href="https://joss.theoj.org/papers/5d0d8c8aa9cfa96790cf4109d1fc00c7"><img src="https://joss.theoj.org/papers/5d0d8c8aa9cfa96790cf4109d1fc00c7/status.svg"></a>
Markdown: [![status](https://joss.theoj.org/papers/5d0d8c8aa9cfa96790cf4109d1fc00c7/status.svg)](https://joss.theoj.org/papers/5d0d8c8aa9cfa96790cf4109d1fc00c7)

Reviewers and authors:

Please avoid lengthy details of difficulties in the review thread. Instead, please create a new issue in the target repository and link to those issues (especially acceptance-blockers) by leaving comments in the review thread below. (For completists: if the target issue tracker is also on GitHub, linking the review thread in the issue or vice versa will create corresponding breadcrumb trails in the link target.)

Reviewer instructions & questions

@teuben & @mbobra, please carry out your review in this issue by updating the checklist below. If you cannot edit the checklist please:

Make sure you're logged in to your GitHub account
Be sure to accept the invite at this URL: https://github.com/openjournals/joss-reviews/invitations

The reviewer guidelines are available here: https://joss.readthedocs.io/en/latest/reviewer_guidelines.html. Any questions/concerns please let @mbobra know.

✨ Please start on your review when you are able, and be sure to complete your review in the next six weeks, at the very latest ✨

Review checklist for @teuben

Conflict of interest

[x] I confirm that I have read the JOSS conflict of interest (COI) policy and that: I have no COIs with reviewing this work or that any perceived COIs have been waived by JOSS for the purpose of this review.

Code of Conduct

[x] I confirm that I read and will adhere to the JOSS code of conduct.

General checks

[x] Repository: Is the source code for this software available at the repository url?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?
[x] Contribution and authorship: Has the submitting author (@amanchokshi) made major contributions to the software? Does the full list of paper authors seem appropriate and complete?
[x] Substantial scholarly effort: Does this submission meet the scope eligibility described in the JOSS guidelines

Functionality

[x] Installation: Does installation proceed as outlined in the documentation?
[x] Functionality: Have the functional claims of the software been confirmed?
[x] Performance: If there are any performance claims of the software, have they been confirmed? (If there are no claims, please check off this item.)

Documentation

[x] A statement of need: Do the authors clearly state what problems the software is designed to solve and who the target audience is?
[x] Installation instructions: Is there a clearly-stated list of dependencies? Ideally these should be handled with an automated package management solution.
[x] Example usage: Do the authors include examples of how to use the software (ideally to solve real-world analysis problems).
[x] Functionality documentation: Is the core functionality of the software documented to a satisfactory level (e.g., API method documentation)?
[x] Automated tests: Are there automated tests or manual steps described so that the functionality of the software can be verified?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Software paper

[x] Summary: Has a clear description of the high-level functionality and purpose of the software for a diverse, non-specialist audience been provided?
[x] A statement of need: Do the authors clearly state what problems the software is designed to solve and who the target audience is?
[x] State of the field: Do the authors describe how this software compares to other commonly-used packages?
[x] Quality of writing: Is the paper well written (i.e., it does not require editing for structure, language, or writing quality)?
[x] References: Is the list of references complete, and is everything cited appropriately that should be cited (e.g., papers, datasets, software)? Do references in the text use the proper citation syntax?

Review checklist for @mbobra

Conflict of interest

[x] I confirm that I have read the JOSS conflict of interest (COI) policy and that: I have no COIs with reviewing this work or that any perceived COIs have been waived by JOSS for the purpose of this review.

Code of Conduct

[x] I confirm that I read and will adhere to the JOSS code of conduct.

General checks

[x] Repository: Is the source code for this software available at the repository url?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?
[x] Contribution and authorship: Has the submitting author (@amanchokshi) made major contributions to the software? Does the full list of paper authors seem appropriate and complete?
[x] Substantial scholarly effort: Does this submission meet the scope eligibility described in the JOSS guidelines

Functionality

[x] Installation: Does installation proceed as outlined in the documentation?
[x] Functionality: Have the functional claims of the software been confirmed?
[x] Performance: If there are any performance claims of the software, have they been confirmed? (If there are no claims, please check off this item.)

Documentation

[x] A statement of need: Do the authors clearly state what problems the software is designed to solve and who the target audience is?
[x] Installation instructions: Is there a clearly-stated list of dependencies? Ideally these should be handled with an automated package management solution.
[x] Example usage: Do the authors include examples of how to use the software (ideally to solve real-world analysis problems).
[x] Functionality documentation: Is the core functionality of the software documented to a satisfactory level (e.g., API method documentation)?
[x] Automated tests: Are there automated tests or manual steps described so that the functionality of the software can be verified?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Software paper

[x] Summary: Has a clear description of the high-level functionality and purpose of the software for a diverse, non-specialist audience been provided?
[x] A statement of need: Do the authors clearly state what problems the software is designed to solve and who the target audience is?
[x] State of the field: Do the authors describe how this software compares to other commonly-used packages?
[x] Quality of writing: Is the paper well written (i.e., it does not require editing for structure, language, or writing quality)?
[x] References: Is the list of references complete, and is everything cited appropriately that should be cited (e.g., papers, datasets, software)? Do references in the text use the proper citation syntax?

whedon commented 4 years ago

Hello human, I'm @whedon, a robot that can help you with some common editorial tasks. @teuben, @hasantahir it looks like you're currently assigned to review this paper :tada:.

:warning: JOSS reduced service mode :warning:

Due to the challenges of the COVID-19 pandemic, JOSS is currently operating in a "reduced service mode". You can read more about what that means in our blog post.

:star: Important :star:

If you haven't already, you should seriously consider unsubscribing from GitHub notifications for this (https://github.com/openjournals/joss-reviews) repository. As a reviewer, you're probably currently watching this repository which means for GitHub's default behaviour you will receive notifications (emails) for all reviews 😿

To fix this do the following two things:

Set yourself as 'Not watching' https://github.com/openjournals/joss-reviews:

watching

You may also like to change your default settings for this watching repositories in your GitHub profile here: https://github.com/settings/notifications

notifications

For a list of things I can do to help you, just type:

@whedon commands

For example, to regenerate the paper pdf after making changes in the paper's md or bib files, type:

@whedon generate pdf

whedon commented 4 years ago

Reference check summary (note 'MISSING' DOIs are suggestions that need verification):

OK DOIs

- 10.1017/pasa.2018.30 is OK

MISSING DOIs

- None

INVALID DOIs

- None

whedon commented 4 years ago

:point_right: Check article proof :page_facing_up: :point_left:

mbobra commented 4 years ago

👋 @teuben @hasantahir Thank you so much for agreeing to review this submission! You'll see your reviewer checklists above. If you'd like more information on any of these items, please reference the review critera. Also, we encourage open communication between the reviewers, submitting author, and editor. So please feel free to ask questions here on this review thread. I'm happy to help!

mbobra commented 4 years ago

/ooo September 14 until September 18

ooo[bot] commented 4 years ago

:+1: Marked @mbobra as OOO from Monday, September 14th 2020 to Friday, September 18th 2020. :calendar:

mbobra commented 4 years ago

@teuben @hasantahir How is the review going? Do you need any help? Please let me know, I'm happy to help!

mbobra commented 4 years ago

@teuben Please respond and let me know if you are still able to review.

mbobra commented 4 years ago

@whedon remove @hasantahir as reviewer

whedon commented 4 years ago

OK, @hasantahir is no longer a reviewer

mbobra commented 4 years ago

@whedon add @mbobra as reviewer

whedon commented 4 years ago

OK, @mbobra is now a reviewer

mbobra commented 4 years ago

👋 @amanchokshi I am going through the checklist and your package looks really nice. It has been an easy review so far. I have a couple comments:

Functionality

I am trying to follow the Embers by Example guide and am stuck on the second code block. Is test_tool installed with embers? When I try to run the command python -m test_tool, I get the error: No module named test_tool.

Documentation

I am a bit confused about the statement of need in the documentation, since it reads slightly differently than the JOSS paper (I am not an expert in radio astronomy).

From the online documentation:

EMBERS is a python package designed to enable polarised measurements of radio telescope antenna beam-patterns using satellites.

From the JOSS paper:

EMBERS is a python package which provides a modular framework for radio telescopes and interferometric arrays such as the MWA, HERA, and the upcoming SKA to accurately measure the all sky beam responses of their antennas using weather and communication satellites.

Could you perhaps clarify and expand the statement of need in the documentation? Also, the documentation doesn't mention the target audience for this software (e.g. astronomers, systems engineers?). Could you please include this?

Software Paper

The statement of need includes this sentence: "Chokshi et al (in prep) represents the first complete large scale implementation of a satellite based beam measurement system." Could you please write one or two sentences about what has been done in the past, why this is an improvement, and how this software compares to other commonly-used packages?
Could you please write a sentence to the explain the gray (unobserved) areas in the caption of Figure 1?
We prefer citations with a DOI. The paper includes a couple references to Chokshi et al., in prep, which isn't yet peer-reviewed and doesn't yet have a DOI. Would it be possible to replace this with another reference or update it with a DOI?

amanchokshi commented 4 years ago

@mbobra Thanks Monica. I've addressed most of your comments.

Functionality

In Embers by Example, the second and third code blocks are meant for people unfamiliar with python. test_tool is an example package which doesn't exist, but is used to demonstrate that the tools can be run using either python -m test_tool or simply test_tool. The following code block show that help can be accessed for each tool using the --help flag. I've made the preceding text clearer by saying that this is an example tool.

Documentation

I've updated the statement of need in the online docs to match the JOSS paper and have included the target audience.

Paper

Added that all previous experiments were validations of the methodology, and this was the first large scale test. There was no previously available software, suitable for the large data volume. I've also update the caption of Fig. 1 with much more detail on what all the plots mean, including the grey regions.

Unfortunately, Chokshi et al., in prep is under collaboration review right now, and will probably take a couple of months before it's published. The other paper which is relevant is Line et al., which I've cited. It's about one of the previous validation experiments and is not as relevant as it deals with much smaller scale and does not use the EMBERS package.

mbobra commented 4 years ago

@whedon generate pdf

whedon commented 4 years ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

mbobra commented 4 years ago

@amanchokshi Thanks for your updates, I really appreciate it. A couple more comments:

Examples I'm working through the examples and am still a bit confused.

One example in the Embers by Example docs says:

To get a quick preview of the raw RF data, we create waterfall plots. The following code creates a waterfall plot of sample data provided with EMBERS
$ waterfall_single
--------------------------------------------------
No input data provided, using packaged sample data
>> waterfall_single --help, for more options
--------------------------------------------------
Waterfall plot saved to ./embers_out/rf_tools/S06XX_2019-10-10-02:30.png

But this doesn't work exactly as written for me; I have to do a

from embers.rf_tools import rf_data
rf_data.single_waterfall

But then I'm stuck, because I don't see any example rf_tiles to use with rf_data.single_waterfall().

Would it be possible to create a couple examples that a user can execute from start to finish? You have a really nice example gallery, so by no means am I asking to re-do all of it, but I think users may be confused about how exactly to get started. Some sample data may be a nice way to quickly show users how they can use the package.

References

Unfortunately, Chokshi et al., in prep is under collaboration review right now, and will probably take a couple of months before it's published. The other paper which is relevant is Line et al., which I've cited. It's about one of the previous validation experiments and is not as relevant as it deals with much smaller scale and does not use the EMBERS package.

I totally understand; does the journal allow you to put Chokshi et al. on a preprint service, such as arXiv? Readers need to have access to the references, so Chokshi et al. must be accessible or replaced with a reference to something else. If you presented the work as a poster or talk at a conference before, then you can put your poster or slides on Zenodo and get a DOI for it and reference that instead. Let me know if you want help with that.

amanchokshi commented 4 years ago

@mbobra thanks Monica, it seems like you've caught a bug in my code. I created a new virtual env and installed EMBERS following my examples. I had the same issue. Apparently my sample data is not being installed with pip. I will try and get it to work or modify my instructions.

For now, what works:

git clone https://github.com/amanchokshi/EMBERS.git
cd EMBERS
pip install --editable .

waterfall_single

References

I will talk to my supervisor and see when I can upload a draft of my paper to arXiv. By the end of this review process I may be able to update the paper with an arXiv doi. If now, I can definitely upload a conference talk to zenodo and cite that.

amanchokshi commented 4 years ago

@mbobra I've fixed the bug. If you update embers to the latest version 0.7.3, the examples should now work as shown in the docs

mbobra commented 4 years ago

@amanchokshi I'm sorry -- I'm probably missing something simple. I'm looking at the Embers by Example guide, which says:

$ mkdir ~/EMBERS
$ cd ~/EMBERS

So I did that. Then I skipped the stuff about the test tool. Then the next code block says:

waterfall_single

But I'm still confused 😟 Where are the sample data? How do I access these data? What import statements do I need to make before I can run waterfall_single? Are there any example notebooks I can look at?

I was looking at a recently published JOSS paper called PyCS3 (which I picked at random). They provide a tutorial and example notebooks, which users can download and run. I don't want to be overly prescriptive -- so PYCS3's approach is by no means the only way to provide examples, but I personally find it easy to understand. Perhaps this can be fixed by a few explanatory sentences if I am missing something simple.

References

I will talk to my supervisor and see when I can upload a draft of my paper to arXiv. By the end of this review process I may be able to update the paper with an arXiv doi. If now, I can definitely upload a conference talk to zenodo and cite that.

Yes another possibility (if you want to this JOSS paper and your other your peer-reviewed paper to act as companion papers) is to pause publication of this JOSS paper until the other one is accepted for publication. Then you can cite the DOI for that paper in this one and vice-versa.

amanchokshi commented 4 years ago

@mbobra

waterfall_single is a cli tool which makes it easy to interact with the underlying single_waterfall function.

Internally, what waterfall_single does:

import pkg_resources
from embers.rf_tools.rf_data import single_waterfall

# If no data provided, use package sample data
rf_file = pkg_resources.resource_filename("embers.kindle", "data/rf_data/S06XX/2019-10-10/S06XX_2019-10-10-02:30.txt")
out_dir  = "embers_out/rf_tools"

single_waterfall(rf_file, out_dir)

In Embers by Example I have added a link which takes you to the api docs of the relevant function. If you think this is sufficient, I can all similar links to all the examples.

Alternately, I could create a notebook, with accompanying example data.

mbobra commented 4 years ago

@amanchokshi

Ah, this is awesome! 🚀 Yes I do think having one or two simple notebooks, exactly like the code block above, will really help folks get started.

I checked off the functionality box above; after the notebooks, I'll check off the examples. Then all we have to do is sort out the references and we're done!

amanchokshi commented 4 years ago

@mbobra Thanks Monica! I'll let you know when I've completed the notebooks. It may take a couple of days.

amanchokshi commented 4 years ago

Hi @mbobra, there have been quite significant updates to the Embers by Example section of the documentation.

Can you begin by updating to version 0.7.4 and deleting all the old files you may have created?

The instructions begin with downloading one day of sample data from a new data repository, which all the subsequent examples use. Each section now has instruction on how to use the cli-tool, with an equivalent example code-block below. I've also linked the functions use which will take you to the relevant section of the EMBERS API.

Certain examples will take a little while to run and I expect that you can run all the examples in less than 3 hours. It's important to run them in the sequence presented, because data products created early on are used at later stages.

Hope things work this time and are much more clear 😁

teuben commented 4 years ago

Just a suggestion on this test though: is there no way you could make this test run in a few minutes? If a cosmological simulation takes 3 hours, you can either take fewer steps or a smaller grid, but it should functionally give some answer (that can be compared to something given as baseline result). I'm not too happy that I would need to wait 3 hours on a test.

On 10/26/20 12:20 PM, Aman Chokshi wrote:

Hi @mbobra https://github.com/mbobra, there have been quite significant updates to the Embers by Example https://embers.readthedocs.io/en/latest/embersbyexample.html section of the documentation.

Can you begin by updating to version 0.7.4 and deleting all the old files you may have created?

The instructions begin with downloading one day of sample data from a new data repository, which all the subsequent examples use. Each section now has instruction on how to use the cli-tool, with an equivalent example code-block below. I've also linked the functions use which will take you to the relevant section of the EMBERS API https://embers.readthedocs.io/en/latest/api.html.

Certain examples will take a little while to run and I expect that you can run all the examples in less than 3 hours. It's important to run them in the sequence presented, because data products created early on are used at later stages.

Hope things work this time and are much more clear 😁

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/openjournals/joss-reviews/issues/2629#issuecomment-716660063, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZ4MGOEXOJQZCQZGVOCCM3SMWOVXANCNFSM4QSAJ3TA.

mbobra commented 4 years ago

@teuben I agree that it would be nice if the tests run faster. This is not a requirement for passing review, though (unless the software makes performance claims). I advise opening an issue on the repo about the runtime. And I'm happy to co-sign that. But this is not a blocking item.

mbobra commented 4 years ago

@amanchokshi Thank you for all this hard work. I'm happy to recommend Embers for publication pending an accessible reference to Chokshi et al., in prep.

amanchokshi commented 4 years ago

Hi @teuben, the tests for embers are automated, and take ~20 minutes on Travis.

The examples shown in Embers by Example do take longer, but are meant to be a guide to show new users how to begin interacting with Embers. I've reduces the amount of sample data as much as possible (only 1 day of data, instead of 6 months), but some of the later functions are statistical in nature and require multiple satellite passes along with their corresponding data files to accurately calibrate the instrument.

amanchokshi commented 4 years ago

Thanks @mbobra, I've talked to my supervisor regarding the reference to Chokshi et al., in prep, and we've decided to remove the reference from the JOSS paper. The other reference, Line et al. outlines the premise of the experimental setup, and provide sufficient scientific background for this work.

I've removed Chokshi et al., in prep, from the paper and have reworded things a bit. Hope that's okay.

mbobra commented 4 years ago

@whedon generate pdf

whedon commented 4 years ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

teuben commented 4 years ago

I've been adding some comments in the issue tracker of the repo, author is working on those. I'm about halfway through the examples, a bit more work than I thought, will take a night sleep on it. more tomorrow.

amanchokshi commented 4 years ago

@teuben Thanks a lot Peter!

teuben commented 3 years ago

@amanchokshi As I'm running through the examples, you mentioned there is a quick test, presumably via pytest.ini and the tests/ directory. I don't see the example command line in the README or somewhere how to run this, so "beginners" will be baffled. Can you add that somewhere (and remind me here as well, as I don't remember)

teuben commented 3 years ago

@amanchokshi apart from some minor things I know you are looking into, my last hurdle on the checkmark list is performance.

There are no good comments on performance that I could find (in fact, in the Example there should be some warnings in more places, but you are working on the aggressive CPU usage already). I have not looked at the code, but let me try to explain how I would lke to try and work towards an answer why this is so slow. We have N files, with each file M lines. Yet, the example you mentioned took 40 core hours. Can this be understood in perhaps an O(N) analysis? Are there no loops in python? Is everything done in good numpy notation,without looping? There is of course the potential speedup using XLA/numba/dask, which is mentioned in one of the issues, but before resorting to this, would be nice if we could understand in simple terms why it takes as long as it does.

amanchokshi commented 3 years ago

I've added instructions to run the automated tests at the bottom of the examples page - Testing EMBERS. It should take no more than 2 minutes to run the tests. To run the test you need to clone the directory and install a few additional dependancies:

git clone https://github.com/amanchokshi/EMBERS.git
cd EMBERS

# Setup a virtual enviroment
python -m venv embers-env
source embers-env/bin/activate

pip install .
pip install -r requirements_dev.txt

# Run automated tests
pytest

amanchokshi commented 3 years ago

@teuben I've been playing around with align_batch and the --max_cores option to try and get a handle of the parallel performance. On my laptop, I've tried running all the sample data with between 1 and 4 cpu cores. I used the GNU time tool to gauge the performance.

The system cpu usage vs number of cpu cores is as expected: system_cpu

The time vs number of cpu cores seems to fall off at higher number of cores. This is probably partially due to running things on my laptop which has a limited resources free. time_cpu

Below are the raw terminal outputs which I used to generate the two figure

$ time align_batch --out_dir=tmp --max_cores=1
>>> Aligned files saved to: tmp
>>> align_batch --out_dir=tmp --max_cores=1  879.14s user 47.24s system 104% cpu 14:45.86 total

$ time align_batch --out_dir=tmp --max_cores=2
>>> Aligned files saved to: tmp
>>> align_batch --out_dir=tmp --max_cores=2  999.20s user 72.56s system 212% cpu 8:23.51 total

$ time align_batch --out_dir=tmp --max_cores=3
>>> Aligned files saved to: tmp
>>> align_batch --out_dir=tmp --max_cores=3  1012.36s user 129.45s system 309% cpu 6:08.94 total

$ time align_batch --out_dir=tmp --max_cores=4
>>> Aligned files saved to: tmp
>>> align_batch --out_dir=tmp --max_cores=4  1143.28s user 137.17s system 401% cpu 5:18.70 total

In the code, I have a for loop which selects pairs of data files, which will be temporally aligned, smoothed using a savgol filter and interpolated to a lower frequency. Beyond the initial for loop to select the files, the interpolation and filtering is achieved using scipy acting on numpy arrays.

I don't really know that much about code optimization, but I think that numpy and scipy may already be optimized to grab available cores. If that's the case, it could explain the drop off we see in time vs number of cores.

The above tests used the 1440 data files available in the sample data repository. The complete data set includes 202,000 files with ~17,000 lines of data each, amounting to almost 300 GB of data.

teuben commented 3 years ago

This is a very useful analysis to see how it scales with the processor! Nicely done. Theoretically for ideal parallel behavior the user and system time should not depend on max_cores (NP), and cpu (elapsed) goes as 1/NP. So, although we do see that the % (user+system)/elapsed) scales quite nicely, you can see the user time goes from 879" to 1143", a 30% "loss". You can see the system time went from 47" to 137", a factor of 3, far worse. This could point to I/O contention, more than CPU s clashing. Modern i7 laptop cpus are pretty good in general, but terrible when it comes to dealing with parallel tasks when 100% of the threads are used. You didn't try more than 4, I'm guessing there are 4 cores and each with 2 threads, the most common situation these days, so CPU wise you didn't get to the worse point. But the system times scales much worse. In myexperience running between 4 and 8 actually doesn't improve the elapsed time, in fact, more than often it reverses and goes slower.

There is still the lingering question if we can understand a single core performance of 15 minutes. Each txt file in tiles_data has about 17k lines. I only see one day (2019-10-10) in that tree, and yes, there are thus 1440 files. So in other words, this is like a N=1440 x M=17000 matrix, or 25M points. The O(N) analysis means that I would like to understand how this problem scales with M and N. And then looking inside, there are only ~25M points. What needs to be done that takes 15 minutes., is that something that makes sense? Another way to think is at a 6GFlop machine, it amounts to ~200k floating point operations per point. This is the point I don't understand about EMBERS, and I think it would add value to the paper (or README somewhere) to say something about it.

teuben commented 3 years ago

I've added instructions to run the automated tests at the bottom of the examples page - Testing EMBERS. It should take no more than 2 minutes to run the tests. To run the test you need to clone the directory and install a few additional dependancies:
git clone https://github.com/amanchokshi/EMBERS.git
cd EMBERS

# Setup a virtual enviroment
python -m venv embers-env
source embers-env/bin/activate

pip install .
pip install -r requirements_dev.txt

# Run automated tests
pytest

this ran fine for me. I had two failures (and a few deprecation warnings):

  tests/test_mwa_utils/test_mwa_pointings.py::test_clean_meta_json_start FAILED                                                                                                                                      
  tests/test_tile_maps/test_ref_fee_healpix.py::test_create_model_last FAILED

amanchokshi commented 3 years ago

The depreciation warnings come from a package which I use - mwa-pb, and has required me to freeze some of the dependancies in the requirement.txt. I'm not sure why those two tests are failing. I cloned a fresh copy of EMBERS and tried the tests again and all of them seem to pass. Could it be a python version thing? I'm stumped

Screen Shot 2020-11-18 at 1 13 39 pm

The automated tests on travis are all passing too. Check out Latest Test and All Past Tests

amanchokshi commented 3 years ago

Regarding the performance, I've re-run align_batch using only a single core, with a varying number of files. The probes your question of how the problem scales with N. The results seem to indicate a linear relationship between number of files and processing time.

files_time

I'm quite out of my depth with this kind of code testing, but let me try and clarify exactly what the code does. Each data file has ~17k lines, which each contain 112 floating point numbers corresponding to the 112 frequency channels available. The 17k lines correspond to 17k timesteps at which power was samples. A pair of two such files are selected.

The two arrays are interpolated down to a lower frequency (1Hz default). This reduces the array to (1800, 112)
A two stage savgol filter is applied along the time axis (1800 points). This is a 1D filter along the longer axis of the array.
The two interpolated and smoothed arrays are saved to disk

amanchokshi commented 3 years ago

I've been looking into the performance as a function of the number of lines in the files. I did this by modifying embers.rf_tools.align_data.savgol_interp, to take a fraction and only read the first fraction of lines in the files. The align_batch function is built around the savgol_interp function by paralelizing it and writing the outputs to disk.

I repeated the test for each fraction I used 10 times, and used the median value as an estimate of time. If we increase the number of loops, I expect the plot to stabilise further. lines_time

My inference of the above plot is that almost half of the processing time for each set of files comes from reading in the data, beyond which the time increases linearly with number of lines.

Below is the code which I used to generate this analysis:

import math
from time import time

import numpy as np
from embers.rf_tools.rf_data import read_data
from matplotlib import pyplot as plt
from scipy import interpolate
from scipy.signal import savgol_filter

def savgol_interp(
    ref,
    tile,
    savgol_window_1=None,
    savgol_window_2=None,
    polyorder=None,
    interp_type=None,
    interp_freq=None,
    fraction=1,
):

    """Interpolate a power array followed by savgol smoothing.
    """

    # Read time and power arrays from data files
    ref_power, ref_time = read_data(ref)
    tile_power, tile_time = read_data(tile)

    ref_length = ref_time.shape[0]
    tile_length = tile_time.shape[0]

    # Number of lines in different files will differ due to varied
    # recording rates of different recievers
    #  print(f"Number of lines in rf0XX file: {ref_length}")
    #  print(f"Number of lines in S06XX file: {tile_length}")

    # Select the first [Fraction] of number of lines
    ref_len_new = int(round(fraction * ref_length))
    tile_len_new = int(round(fraction * tile_length))

    ref_power = ref_power[:ref_len_new, ::]
    ref_time = ref_time[:ref_len_new]

    tile_power = tile_power[:tile_len_new, ::]
    tile_time = tile_time[:tile_len_new]

    # Round up/down to nearest integer of time
    start_time = math.ceil(max(ref_time[0], tile_time[0]))
    stop_time = math.floor(min(ref_time[-1], tile_time[-1]))

    # Array of times at which to evaluate the interpolated data
    time_array = np.arange(start_time, stop_time, (1 / interp_freq))

    # Mathematical interpolation functions
    f = interpolate.interp1d(ref_time, ref_power, axis=0, kind=interp_type)
    g = interpolate.interp1d(tile_time, tile_power, axis=0, kind=interp_type)

    # New power array, evaluated at the desired frequency
    ref_ali = f(time_array)
    tile_ali = g(time_array)

    # Savgol level 1. Capture nulls / small scale structure
    ref_ali = savgol_filter(ref_ali, savgol_window_1, polyorder, axis=0)
    tile_ali = savgol_filter(tile_ali, savgol_window_1, polyorder, axis=0)

    # Savgol level 2. Smooth noise
    ref_ali = savgol_filter(ref_ali, savgol_window_2, polyorder, axis=0)
    tile_ali = savgol_filter(tile_ali, savgol_window_2, polyorder, axis=0)

    return (ref_ali, tile_ali, time_array, ref_power, tile_power, ref_time, tile_time)

if __name__ == "__main__":

    ref = "tiles_data/rf0XX/2019-10-10/rf0XX_2019-10-10-00:00.txt"
    tile = "tiles_data/S06XX/2019-10-10/S06XX_2019-10-10-00:00.txt"

    fractions = np.linspace(0.02, 1, 50)

    med_time = []
    for f in fractions:

        # Repeat measurements 10 times and take median
        time_array = []
        for i in range(10):
            start = time()
            si_out = savgol_interp(
                ref,
                tile,
                savgol_window_1=11,
                savgol_window_2=15,
                polyorder=2,
                interp_type="cubic",
                interp_freq=1,
                fraction=f,
            )
            end = time()
            time_array.append(end - start)

        med_time.append(np.median(time_array))

    fig = plt.figure(figsize=(6, 4))
    plt.style.use("seaborn")
    plt.plot(fractions, med_time, linewidth=2.1, marker="o", color="maroon")
    plt.xlabel("Fraction of Number of lines")
    plt.ylabel("Median Time [seconds]")
    plt.tight_layout()
    plt.savefig("lines_time.png")

teuben commented 3 years ago

The depreciation warnings come from a package which I use - mwa-pb, and has required me to freeze some of the dependancies in the requirement.txt. I'm not sure why those two tests are failing. I cloned a fresh copy of EMBERS and tried the tests again and all of them seem to pass. Could it be a python version thing? I'm stumped

The automated tests on travis are all passing too. Check out Latest Test and All Past Tests

I'm stumped too, i'm going to ignore this for the review, as the examples seem to run (ok , i do have one failure still, the ephem_chrono task).

amanchokshi commented 3 years ago

Hi Peter, I'll try and summarise what I've found in my performance tests:

O(N) - Performance scales linearly with the number of files, on a single core
O(M) - Performance scales linearly with the number of lines of data, with a significant overhead reading in the data. By my estimate, more than half the time processing a single file of ~17k lines can be attributed to reading in the file. This is ~0.27 seconds per file. The processing itself is done with scipy interpolation + filtering, which scales linearly with array size

teuben commented 3 years ago

thanks Aman, this is indeed very useful stuff to understand. 50% in the I/O is quite a lot. These are 2kb files, that should read in 0.02ms on a typical 100MB/s disk. So a significant time it does something else. Maybe the decoding? Even the readlines() routine isn't able to perform at formal speed. I measured indeed about 250ms for reading one file. When not processing any line, so returning nothing, it still look 5-10 ms. I might play with this, I didn't see any obvious ways to speed this up.

On 11/18/20 8:02 PM, Aman Chokshi wrote:

Hi Peter, I'll try and summarise what I've found in my performance tests:

O(N) - Performance scales linearly with the number of files, on a single core

O(M) - Performance scales linearly with the number of lines of data, with a significant overhead reading in the data. By my estimate, more than half the time processing a single file of ~17k lines can be attributed to reading in the file. This is ~0.27 seconds per file. The processing itself is done with |scipy| interpolation + filtering, which scales linearly with array size

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/openjournals/joss-reviews/issues/2629#issuecomment-730054795, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZ4MGONTDIFMWS7OYUI55LSQRVBJANCNFSM4QSAJ3TA.

teuben commented 3 years ago

thanks Aman, this is indeed very useful stuff to understand. 50% in the I/O is quite a lot. These are 2kb files, that should read in 0.02ms on a typical 100MB/s disk. So a significant time it does something else. Maybe the decoding? Even the readlines() routine isn't able to perform at formal speed. I measured indeed about 250ms for reading one file. When not processing any line, so returning nothing, it still look 5-10 ms. I might play with this, I didn't see any obvious ways to speed this up.

oops, i messed up my units. The files are 2MB, so that should be around 10ms, which corresponds to my just processing the lines, no decoding. So that decoding is expensive, more like 200 ms, 20 times as expensive as reading the line.

teuben commented 3 years ago

To close up the review, and assuming you can add a few words on the performance, the only puzzle remaining is why I could not get ephem_chrono to work. It seems a few other tools after that also fail for me, possibly because of this? I also had a case where leaving out the --max_cores=4 would cause it to fail with the error "ValueError: max_workers must be greater than 0"

amanchokshi commented 3 years ago

I think I know where the error in --max_cores comes from. I'll test it out and ping you. The ephem_chrono problem is a puzzle, but I'll try it again once I've finished the max_core issue

amanchokshi commented 3 years ago

The batch scripts should now work after you update EMBERS to the latest version - 0.8.2. I had added a bad default value to the max_cores argparse option.

amanchokshi commented 3 years ago

I ran ephem_chrono again and things worked. Do you have a folder called TLE in the sat_utils directory? This should have been copied over when you cloned the sample data and moved the sample data to the correct dirs.

Before running ephem_batch:

tree -d sat_utils
sat_utils
└── TLE

After running ephem_batch:

tree -d sat_utils
sat_utils
├── TLE
├── ephem_data
└── ephem_plots

The above data is required for ephem_chrono to work. Finally, after ephem_chrono runs successfully:

tree -d sat_utils
sat_utils
├── TLE
├── ephem_chrono
├── ephem_data
└── ephem_plots

Can you delete all the data in the sat_utils directory, except the TLE folder, then run ephem_batch followed by ephem_chrono?

openjournals / joss-reviews