Master List of data.table Issues for GSoC '24 (Josh)

joshhwuu commented 5 months ago

Blog:

https://joshhwuu.github.io/

Issues:

[x] #6141
[ ] #5611
[x] #4549
[x] #2492
[ ] #5558
[x] #2964
[x] #981
[x] #5411
[x] #5409
[ ] #6286
[x] #3060

A list of pull requests can be found here: https://github.com/Rdatatable/data.table/pulls/joshhwuu

joshhwuu commented 5 months ago

@tdhock @Anirban166 Hey mentors, we should use this thread as a way to discuss things that don't directly have to do with the other members of data.table, feel free to comments if you have any questions or concerns.

joshhwuu commented 5 months ago

Going to start updating this thread to keep mentors updated with progress. @tdhock @Anirban166

May 28, 2024 (Week 1)

I submitted/merged PR #6150 Fix Windows Parsing Issue which aims to fix an encoding issue related with the base locale on certain Windows builds, as pointed out by Michael and can be found here: https://github.com/Rdatatable/data.table/actions/runs/9140204969/job/25133295034 For this issue I tried to recreate the problem by setting locales in my Linux system, was able to reproduce with LC_CTYPE being the latin-1/Windows 1252 equivalent for Linux, and then set the tests in a local to use a utf-8 locale, which should work for modern systems, although I'm not sure if it'll work on all Windows builds, even those that might not have utf-8 available as a locale.

PR #6151 Update fifelse documentation has also been submitted, awaiting review, this PR updates fifelse documentation to warn users not to use it in cases such as recursive functions because of its evaluation pattern, and hints to fcase, need advice on whether to update fcase documentation as well.

Draft #6158 fread blank.lines.skip gains new value has also been submitted, current LF help on a question I'm having with the C code, if anyone has any suggestions that would be great.

tdhock commented 5 months ago

looks great, thanks for sharing. please also consider making a blog, posting weekly

joshhwuu commented 5 months ago

Yes, I have set one up but forgot to link it, I'll link it here and at the top of the page here, I'll look to post an update on the blog every saturday/sunday: https://joshhwuu.github.io/

Anirban166 commented 5 months ago

Looks great to me as well! Keep up the good work and don't forget to include links to your blog posts as and when you get done writing/publishing them (along with reproducible examples and everything else we have been discussing during our weekly calls apart from sharing links to your blog)

joshhwuu commented 5 months ago

Hey mentors! As promised, this is the newest entry on my blog linked here

tdhock commented 5 months ago

Hi Josh your blog looks great thanks also please add links on https://github.com/rdatatable/data.table/wiki/Articles and you may consider making an RSS feed and submitting your blog to https://www.r-bloggers.com/add-your-blog/

joshhwuu commented 5 months ago

@tdhock @Anirban166

June 3, 2024 (Week 2)

This week, I am planning to work on #5611, #5558, #5411.

#5611 is a WIP, I submitted a draft PR. I will do some digging to see what exactly the two assertions are testing for, as they are nocov. If I can figure that out, then we can confirm how well the change works.

I just submitted a PR for #5558, awaiting review here: #6167. I thought the changes were quite straightforward but if anyone sees any potential issues, LMK.

#5411, like some of the old issues should be a documentation change, so I expect most of the time to be on reviewing, as documentation needs to be perfect.

Currently, I have PR #6165 open as well, just waiting on a review from Jan, so we'll see how that goes.

Of course, if I manage to finish most of my tasks, I'll be adding in more work from my proposal/data.table issues. Cheers!

joshhwuu commented 4 months ago

@tdhock @Anirban166

June 10, 2024 (Week 3)

Now that we're coming down to the last week of phase 1 of my proposed project, I have a few more issues to submit PRs for: #981 and #5411. I also recently added #5409 to work on as well, so we'll see how well the prior two issues go before I start on that one.

I have a few open PRs still, needing some reviews but those can be worked on as the summer progresses as well, just hopefully we'll merge some in the coming days so there's less merge conflicts with new PRs.

joshhwuu commented 4 months ago

Update on blog! @tdhock @Anirban166

https://joshhwuu.github.io/2024-06-16-GSoC-Wk3/

joshhwuu commented 4 months ago

@tdhock @Anirban166

June 17, 2024 (Week 4)

Today I opened a new PR for #4280, and will be looking to wrap that up with pending reviews. Additionally, I'm going to be looking at merging open PRs, most notably #6167 and #6165, whenever Michael and Jan get a chance to take a look. As for other open PRs, it seems that everyone agrees #6175 is unnecessary so I'll potentially be adding some documentation to point users in the right direction if they ever need something similar. #6158 has a big issue currently, which is that fread simply shouldn't (IMO) be able to skip beginning empty lines, as it is important that the first line read determines how columns are made. If we skip the beginning lines, the resulting data.table is always single-columned. Will follow up with Ben sometime to see if he agrees, and if my assumption is correct we can probably close that issue as well.

Something bigger I'll be working on:

Although this wasn't in my initial proposal, Issue #5409 is a very good example of why assign.c's internal SEXP assign should be refactored, as Ben mentioned here. IMO, the behavior of set and := should be the same, but currently both of them call very different parts of SEXP assign, causing slight differences such as the one outlined in Issue #5409. It would help to refactor this file, but seeing that it's a very long file I'll be working on this slowly as I proceed with other issues as well. Although I think assign is very well tested so refactoring should be pretty safe.

joshhwuu commented 4 months ago

Hey @Anirban166 @tdhock

Just a slight heads up that I'll be taking a final exam for my summer course on the next coming Monday, so in case there's not too much progress on current tasks that is why. However I'm feeling quite prepared for the exam so I don't think I'll be spending a whole lot of time on studying. Just a note that this is my only course this summer as well so I'll be done with everything after Monday, thanks!

Anirban166 commented 4 months ago

Hey @Anirban166 @tdhock

Just a slight heads up that I'll be taking a final exam for my summer course on the next coming Monday, so in case there's not too much progress on current tasks that is why. However I'm feeling quite prepared for the exam so I don't think I'll be spending a whole lot of time on studying. Just a note that this is my only course this summer as well so I'll be done with everything after Monday, thanks!

All the best!

joshhwuu commented 4 months ago

@tdhock @Anirban166

June 25 (Week 5)

This week I'm going to be working on prototypes for adding a progress bar for large by operations, I'm expecting this to be on the heavier end. Jan said that we should test revdeps for PR 6165, but I noticed in .dev/revdep.R that this script is meant to be run by the package maintainer, and there aren't instructions to run the script so I am looking for advice on how to check those for this change. I added a new PR 6204, pending review. As of right now, PR #6175 doesn't seem like a good idea anymore, so I'm leaning towards closing and following up in the original issue to see how we should proceed. Hoping to make good progress on the bigger issues this week!

tdhock commented 4 months ago

Hi Josh, since last year we are no longer using .dev/revdep.R and instead using daily checks of master on the NAU Monsoon super-computer, https://github.com/Rdatatable/data.table/wiki/Release-management-and-revdep-checks which takes ~10 hours walltime instead of about ~20 days if you were to run .dev/revdep.R on your laptop, sequentially checking all 1505+ revdeps. so we will merge your PR into master first, and then we will see the revdep results the next day on a page like this https://rcdata.nau.edu/genomic-ml/data.table-revdeps/analyze/2024-06-25/

joshhwuu commented 4 months ago

@tdhock @Anirban166

July 1 (Week 6)

Happy Canada Day! 🍁🎉 This week I'm going to be working on some of the larger issues I mentioned before - these have turned out to be quite difficult so I'm going to be spending a while on them. Other than that, I'm going to be browsing other open issues that I can potentially take on whenever I get the chance.

Also a quick note: I'll be visiting SF with a couple friends during the weekend, and as per GSoC rules, I shouldn't contribute code while in a different country, so I'll be unavailable during the weekend in case of review suggestions. I'll be back in Canada by next Monday however.

tdhock commented 4 months ago

great thanks please submit comments and partial PRs so you can get our feedback, even before the larger issues are solved

joshhwuu commented 4 months ago

great thanks please submit comments and partial PRs so you can get our feedback, even before the larger issues are solved

Sure! TBH I am spending a lot of time just trying to understand the code.. So not much to show for other than some print statements to find points in code when executing "by" operations. AFAIU so far, by operations are either handled by gforce or by dogroups. This means that there's no easy way to handle the creation of a progress indicator from data.table.R as the real-time updating of groups is not within scope. This means I will likely have to make a progress indicator in the C level within dogroups.c, and for the gforce optimized functions as well. I'm wondering if it is necessary however to make one for gforce, as these tend to be very fast. @tdhock @Anirban166 LMK your thoughts on this..

joshhwuu commented 4 months ago

Okay. I have gotten a prototype that in dogroups.c that kind of does what we want it to do. Something like: Processed group 1000000 out of 1000000. Elapsed time: 9.32 seconds. Estimated time remaining: 0.00 seconds. that updates for every iteration within the main SEXP dogroups function. Now, I am trying to figure out why my elapsed/total time doesn't add up correctly for large operations (something to do with C's clock() being the CPU time when I want system time. I'll be working on reading up on this from a few resources, such as https://people.cs.rutgers.edu/~pxk/416/notes/c-tutorials/times.html

tdhock commented 4 months ago

look at fread/fwrite to see other examples of progress reporting

joshhwuu commented 4 months ago

Hm.. I have a few questions about the other progress reporting. It seems that fread uses a custom reporting function void progress and a (gettime-ish?) wallclock(). It seems that fwrite imports wallclock() and uses it to get timings instead of clock(). Does wallclock() have lower overhead than clock()? (I couldn't find any documentation on wallclock() ) I noticed that clock() in C has pretty high overhead and slows down the processing of the main dogroups loop by around 30-40% over large (>1000000) iterations (this also increases when we start calculating average time, estimated time remaining, and especially calling Rprintf()). I was confused as dogroups.c uses clock() a few times, so I'm not sure what to use best in this situation. I'll benchmark with clock() and wallclock() a few times to see if there are any tangible differences in running time, assuming that both return something similar and I'm understanding the purpose of wallclock.

Also, just a note: I think fwrite could also use void progress in its reporting, seems that there's a TODO in place to make the change as well

https://github.com/Rdatatable/data.table/blob/b7c5eaa3eb8d02c044dfa7b6af83a304a7385817/src/fwrite.c#L952

joshhwuu commented 4 months ago

Timings here:

Not calling clock() within dogroups main loop (current behavior)

> library(microbenchmark)
> dt = data.table(a = 1:1000000)
> microbenchmark(dt[, 1, by = a], times = 20)
Unit: milliseconds
            expr      min       lq     mean   median       uq      max neval
 dt[, 1, by = a] 382.7451 388.2441 455.9506 393.6899 406.1546 1595.355    20

Calling clock() once

Unit: milliseconds
            expr      min       lq    mean  median       uq      max neval
 dt[, 1, by = a] 516.2963 519.4566 523.264 522.445 525.3977 540.2256    20

Calling wallclock() once

Unit: milliseconds
            expr      min       lq     mean   median       uq      max neval
 dt[, 1, by = a] 434.4907 438.4205 445.4385 441.9558 446.7205 479.5391    20

tdhock commented 4 months ago

I don't know much about internals of fread/write (progress function, wallclock vs clock etc) but you should create an issue/pr and ask Micheal/Jan/others for the timings of different versions of C code, you can use atime to compare them, if you have the different versions in different git commits, see https://github.com/tdhock/atime/blob/main/vignettes/data.table.Rmd#L107

joshhwuu commented 3 months ago

@Anirban166 @tdhock

July 15 (Week 8)

Hey mentors, I apologize for the slow progress last week, I caught a nasty flu of some sort and had to take a few days to properly rest (Good thing I allotted a lot of time for these larger issues, we are still on track to finish 😆). Now that I'm better, I'm happy to say that I've been making good progress on the progress indicator for by operations, #6228. Alongside this, #6204 looks good to go as well. #6165 was merged so I'll have to take a look at revdeps this week. Still waiting on Michael's review of #6167, but with the new release on the horizon and some R CMD check occasional issues, I imagine this quite hefty PR is going to take a while before going to merge.

joshhwuu commented 3 months ago

@tdhock @Anirban166

July 23 (Week 9)

This week I'm looking forward to apply reviews and merge some of the open PRs that I have. Other than that, I'm actively seeking some larger issues to work on within the issue tracker, notably either from the 1.16.0 milestone or the most-requested issues list. As per my proposal, I'm looking to completing two more larger issues and I will have completed my goal. Obviously this doesn't mean I have to stop there, as I realize now that I've given myself quite a generous amount of time, so I'm going to do my best to do as much as I can!

joshhwuu commented 2 months ago

@tdhock @Anirban166

Hi mentors! For the past few weeks, I've been looking for more issues to close, and I've been able to write out some prototypes/proof-of-concepts for quite a few of them. As we are coming close to the release date for 1.16.0, I am a little hesitant to open new PRs, as it seems that it will be a while before they can get reviews and go towards merge. I find that I am more productive when my PRs are reviewed faster, as with GSoC I often work on several PRs at once, and when reviews are (understandably) slower, I tend to sometimes forget things about the changes, causing me to miss things.

Recently, I've been engaging in discussion and investigation regarding changes to test.data.table and its double printing problem and lintr's inefficiency with .Rraw files. I hope to open PRs for these as I have local branches with changes ready. These will likely come out after 1.16.0 release on CRAN (even if that comes past the GSoC deadline). I have also been looking into refactoring some calls that can be simplified via helpers within our R code (centralizing handling of index retrieval, etc). On top of this, I've been also looking at some of the mutation testing results and will help to investigate.

My plan for the last two weeks of GSoC is to submit my remaining changes as PRs after 1.16.0 is live on CRAN and more of the team is available, try to actively participate in discussions and follow up with some old issues that can potentially be closed.

Overall, I've been very happy with the contributions I made, and would like to think that my work has been impactful (awesome to see my name on NEWS.md many times). If any of you have feedback/suggestion for me, please let me know.

Question:

At the end of GSoC, I'd like to formally thank the data.table team and anyone else who has helped out by commenting and reviewing. What would you recommend be the best medium to use to do this? Maybe some sort of email group or potentially a community blog post? It would be a short message though, so I'm not sure what would be best, let me know if you have any ideas. Thanks!

tdhock commented 2 months ago

best way to contact the team is via issues, which are typically not used for "thanks" but would be great to see you keep helping on issues/PRs after GSOC is done. If you write a blog, please post a link on https://github.com/Rdatatable/data.table/wiki/Articles

joshhwuu / gsoc-2024