RING-44425 - Comments for SPARK scripts

TrevorBenson commented 11 months ago

There are a few spots with ???? which either the entire operation I wanted to confirm before adding a comment, or there is a description, but I am not 100% confident in its accuracy and need to compare some more current output from the scripts to be sure.

TrevorBenson commented 11 months ago

stashing my review here — to be continued. I'm considering going all the way down to the very variables that are used, to have clear view on what data frames we create and filter out, etc.

I've applied most. Questions posed to outstanding suggestions.

TrevorBenson commented 11 months ago

New pass on P0. TODO: evaluate the need for each column to exist.

Let's keep the focus on comments, including if columns are used or not used in later steps.

Once we get this approved we can use a new ticket to suggest minimizing the data written to fields actually used in later scripts. :crossed_fingers: with our comments detailing how it all works we can get approval to change the actual scripts.

scality-fno commented 11 months ago

New pass on P0. TODO: evaluate the need for each column to exist.

Let's keep the focus on comments, including if columns are used or not used in later steps.

Once we get this approved we can use a new ticket to suggest minimizing the data written to fields actually used in later scripts. 🤞 with our comments detailing how it all works we can get approval to change the actual scripts.

👍 I'll make the time for P1, then

TrevorBenson commented 11 months ago

Take a look at p1-p4 now. They all have input structures, required fields, etc.

The WT Heck is inside S3_FSCK files google sheet had a header added for at least one tab. This may cause a comment to incorrectly state it is reading with header describing input which may not actually have a header, needing to be changed to _c0, _c1 etc.

Trevor Benson Scality - Staff Engineer +1 (707) 479 2965 (m) | @.***

@scality https://twitter.com/scality

@trevorbenson https://www.linkedin.com/in/trevorbenson/

scality.com https://www.scality.com

On Tue, Oct 10, 2023 at 8:59 AM scality-fno @.***> wrote:

New pass on P0. TODO: evaluate the need for each column to exist.

Let's keep the focus on comments, including if columns are used or not used in later steps.

Once we get this approved we can use a new ticket to suggest minimizing the data written to fields actually used in later scripts. 🤞 with our comments detailing how it all works we can get approval to change the actual scripts.

👍 I'll make the time for P1, then

— Reply to this email directly, view it on GitHub https://github.com/scality/spark/pull/48#issuecomment-1755748358, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACF6ID4Q57E4MLIUVPANLTLX6VWFZAVCNFSM6AAAAAA5LAI4A6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONJVG42DQMZVHA . You are receiving this because you authored the thread.Message ID: @.***>

TrevorBenson commented 10 months ago

"P1" review: thanks to @fra-scality and his jupyter notebooks, figured out how the dataframes were used. The more I read it, the less I understand. Either I'm missing sth obvious or the naming is terrible.

It might be useful to contribute those to the repository so we can improve upon them as needed. We might need to repeat this process for SOFS_FSCK.

scality / spark

RING-44425 - Comments for SPARK scripts #48