Closed vepadulano closed 3 years ago
Followup to the discussion on mattermost:
I've benchmarked four alternative ways of creating the clustered ranges.
get_clusters
and check it in get_clustered_ranges
without modifying the algorithmAll the code can be found at https://cernbox.cern.ch/index.php/s/eebJBNaNpRbwupO
I've run 1000 execution for each python script on a 4 core machine with CVMFS using LCG_98python3spark3 . I get the following boxplot
So it seems the differences in performance are truly minimal. I am personally in favor of this PR for ease of reading, I wouldn't mind also changing it to use one list comprehension (it would make get_clustered_ranges
much shorter) i.e.
def get_clustered_ranges_onelistcomp(treename, filelist, npartitions,
friend_info=None):
filesandclusters = get_clusters(treename, filelist)
clusters_split_by_partition = list(
_n_even_chunks(filesandclusters, npartitions))
clustered_ranges = [
Range(
min(clusters)[0] - clusters[0].offset,
max(clusters)[1] - clusters[0].offset,
[
filetuple.filename
for filetuple in sorted(set([ # set to take unique file indexes
cluster.filetuple for cluster in clusters
]))
],
friend_info
)
for clusters in clusters_split_by_partition
]
return clustered_ranges
The failures in the CI are related to Python 2 not having defaults
parameter in the namedtuple
Other question: Can you run PyRDF with a local spark cluster? Like you can use PySpark locally?
Other question: Can you run PyRDF with a local spark cluster? Like you can use PySpark locally?
Yes definitely! It will invoke the local pyspark transparently (that's our CI works as well)
Other question: Can you run PyRDF with a local spark cluster? Like you can use PySpark locally?
Yes definitely! It will invoke the local pyspark transparently (that's our CI works as well)
Ah I got confused by the naming "local backend" for usage of plain ROOT :D Alright, thanks!
The last commit polishes the logic to work in a single list comprehension (thus using the generator function directly instead of creating a list first), updates and extends documentation. The PR description has been updated per the last commit message
Last push takes into account comments by @etejedor and adds documentation for the reasoning behind subtracting the offset of the first file in each range from the start
and end
entries.
A clustered range is the representation of the entries that span multiple clusters of one or more files. For example, clusters(inclusive) (0,10),(11,20),(21,30) could belong to the same partition [0,30]. Furthermore, different clusters may belong to different files so that in the end for a single partition there could be multiple files involved, e.g.
Previously it was the case that if different clusters belonging to the same partition were inside the same file, the file would be counted only once. Although it may seem right, there are some cases where this may lead to reading fewer entries than necessary. Notably, when giving multiple times the same file as input to RDataFrame to augment the dataset size.
The new algorithm takes on a more robust approach to the division of clusters along partitions and at the same time avoids said issue. On the one hand, the clusters computed in the get_clusters function are now a namedtuple of the form (start, end, offset, (filename, fileindex)); in this way each cluster can also be uniquely linked to a specific file (even when multiple files have the same name). On the other hand, the _get_clustered_ranges function has been completely refactored to a single list comprehension that transforms a list of clusters to a list of
Range
namedtuple objects of the form (start, end, [file_1,...,file_n], friend_info).Some extra features included:
nentries
parameter in _get_clustered_ranges function has been remove because unnecessary