Refactor creation of clustered ranges in distributed backend

vepadulano commented 4 years ago

A clustered range is the representation of the entries that span multiple clusters of one or more files. For example, clusters(inclusive) (0,10),(11,20),(21,30) could belong to the same partition [0,30]. Furthermore, different clusters may belong to different files so that in the end for a single partition there could be multiple files involved, e.g.

Partition [0,100]
List of file,cluster belonging to the partition: [(file_1,(0,33)),(file_2,(34,66)),(file_3,(67,100))]

Previously it was the case that if different clusters belonging to the same partition were inside the same file, the file would be counted only once. Although it may seem right, there are some cases where this may lead to reading fewer entries than necessary. Notably, when giving multiple times the same file as input to RDataFrame to augment the dataset size.

The new algorithm takes on a more robust approach to the division of clusters along partitions and at the same time avoids said issue. On the one hand, the clusters computed in the get_clusters function are now a namedtuple of the form (start, end, offset, (filename, fileindex)); in this way each cluster can also be uniquely linked to a specific file (even when multiple files have the same name). On the other hand, the _get_clustered_ranges function has been completely refactored to a single list comprehension that transforms a list of clusters to a list of Range namedtuple objects of the form (start, end, [file_1,...,file_n], friend_info).

Some extra features included:

Logging has been extended to the Dist module
The old Dist.Range class has been substituted by a more simple and efficient namedtuple
The nentries parameter in _get_clustered_ranges function has been remove because unnecessary
Docs have been updated and extended

vepadulano commented 4 years ago

Followup to the discussion on mattermost:

I've benchmarked four alternative ways of creating the clustered ranges.

PyRDF master
Enric's suggestion to add a filename index in get_clusters and check it in get_clustered_ranges without modifying the algorithm
This PR, using Enric's suggestion but with heavy usage of list comprehensions
The same algorithm of this PR but with only one list comprehension instead of multiple ones.

All the code can be found at https://cernbox.cern.ch/index.php/s/eebJBNaNpRbwupO

I've run 1000 execution for each python script on a 4 core machine with CVMFS using LCG_98python3spark3 . I get the following boxplot benchmark_boxplot

So it seems the differences in performance are truly minimal. I am personally in favor of this PR for ease of reading, I wouldn't mind also changing it to use one list comprehension (it would make get_clustered_ranges much shorter) i.e.

def get_clustered_ranges_onelistcomp(treename, filelist, npartitions,
                                     friend_info=None):
    filesandclusters = get_clusters(treename, filelist)

    clusters_split_by_partition = list(
        _n_even_chunks(filesandclusters, npartitions))

    clustered_ranges = [
        Range(
            min(clusters)[0] - clusters[0].offset,
            max(clusters)[1] - clusters[0].offset,
            [
                filetuple.filename
                for filetuple in sorted(set([  # set to take unique file indexes
                    cluster.filetuple for cluster in clusters
                ]))
            ],
            friend_info
        )
        for clusters in clusters_split_by_partition
    ]

    return clustered_ranges

vepadulano commented 4 years ago

The failures in the CI are related to Python 2 not having defaults parameter in the namedtuple

stwunsch commented 4 years ago

Other question: Can you run PyRDF with a local spark cluster? Like you can use PySpark locally?

vepadulano commented 4 years ago

Other question: Can you run PyRDF with a local spark cluster? Like you can use PySpark locally?

Yes definitely! It will invoke the local pyspark transparently (that's our CI works as well)

stwunsch commented 4 years ago

Other question: Can you run PyRDF with a local spark cluster? Like you can use PySpark locally?

Yes definitely! It will invoke the local pyspark transparently (that's our CI works as well)

Ah I got confused by the naming "local backend" for usage of plain ROOT :D Alright, thanks!

vepadulano commented 3 years ago

The last commit polishes the logic to work in a single list comprehension (thus using the generator function directly instead of creating a list first), updates and extends documentation. The PR description has been updated per the last commit message

vepadulano commented 3 years ago

Last push takes into account comments by @etejedor and adds documentation for the reasoning behind subtracting the offset of the first file in each range from the start and end entries.

vepadulano / PyRDF

Refactor creation of clustered ranges in distributed backend #102