mscross / pysplit

A package for HYSPLIT air parcel trajectory analysis.
BSD 3-Clause "New" or "Revised" License
149 stars 80 forks source link

Problem loading hysplit file with multiple trajectories #75

Closed hmjbarbosa closed 3 years ago

hmjbarbosa commented 3 years ago

There seems to be a bug in the pysplit code intended for reading a hysplit file with multiple trajectories in it. I was able to trace it back and fix the issue, and I'll try to explain it below.

The function load_hysplitfile() (inside file hyfile_handler.py) decides if it is reading a file with a single or multiple trajectories in it:

def load_hysplitfile(filename)
# (... many lines...) 
    # Split hydata into individual trajectories (in case there are multiple)
    if multiple_traj:
        hydata, pathdata, datetime = _trajsplit(hydata, pathdata, timedata,
                                                century)
    else:
        print('hmjb', timedata)
        datetime = _getdatetime(century, timedata)

Function _trajsplit() will then sort the hysplit lines, so that all lines corresponding to the same trajectory are grouped:

def _trajsplit(hydata, pathdata, timedata, century):
# (... many lines...) 
    # Find number of unique trajectories within `hydata`
    unique_traj = np.unique(hydata[:, 0])

    # Sort the array row-wise by the first column
    # Timepoints from same traj now grouped together
    sorted_indices = np.argsort(hydata[:, 0], kind='mergesort')
    sorted_hydata = hydata[sorted_indices, :]
    sorted_pathdata = pathdata[sorted_indices, :]
    sorted_timedata = timedata[sorted_indices, :]

It then tries to identify (in the sorted arrays) where is the start position (i.e. first line number) for each trajectory.

    # Find first occurrence of each traj, except for the first
    # which is obviously 0
    first_occurrence = [np.nonzero(sorted_indices == u)[0][0]
                        for u in unique_traj[1:]]

The problem happens at this point.

The code takes each trajectories number (in my case, trajectories 1 to 8) and search for all the positions (in the sorted arrays) where that trajectory occurs. The problem is the use of sorted_indices, which hold the positions in the sorted array. If we print the values in sorted_indices[:] we will have numbers 0 to the number of lines in the hysplit file (in my case, 124).

The correct would be to use sorted_hydata[:,0] which is the first column (i.e. ID of trajectory) in the sorted arrays. If we print the values in sorted_hydata[:,0], they will be those in unique_traj (in my case, 1 to 8). The corrected code looks like this:

    # Find first occurrence of each traj, except for the first
    # which is obviously 0
    # BUG first_occurrence = [np.nonzero(sorted_indices == u)[0][0]
    first_occurrence = [np.nonzero(sorted_hydata[:,0] == u)[0][0]
                        for u in unique_traj[1:]]

Could any of the developers have a look and confirm this?

mscross commented 3 years ago

You're completely right, looking through sorted_indices for the first occurrence of each traj is nonsense; we need to look through the traj numbers in the first col of sorted_hydata. Want to make a PR?

hmjbarbosa commented 3 years ago

I just submitted the pull request. I tried to link it to this issue (for automatic closing it).