read_file(f, stop_before_pixels=True) is slower than without

suever commented 9 years ago

From njv...@gmail.com on December 03, 2010 11:13:32

On my DICOMs, (2D slices from a GE MRI scanner) I've found that telling PyDICOM to skip the pixel data is actually slower than reading it. This is both true when repeatedly re-reading one file with timeit, or in running time_test.py

Point time_test.py at a directory full of dicoms
Enable only test_full_read() and test_partial(). Don't set dataset=.
Run a bunch of times, compare times.

test_full_read() 601937 function calls (601137 primitive calls) in 1.750 CPU seconds

ncalls tottime percall cumtime percall filename:lineno(function) 102100 0.906 0.000 0.906 0.000 {method 'read' of 'file' objects} 34200/33600 0.211 0.000 0.656 0.000 filereader.py:150(data_element_generator) 35100 0.141 0.000 0.171 0.000 tag.py:8(Tag) 400/200 0.120 0.000 1.032 0.005 filereader.py:259(read_dataset) 33800 0.086 0.000 0.086 0.000 {built-in method new of type object at 0x100126080}

test_partial() 829791 function calls (828991 primitive calls) in 2.355 CPU seconds

ncalls tottime percall cumtime percall filename:lineno(function) 101900 1.048 0.000 1.048 0.000 {method 'read' of 'file' objects} 67800 0.275 0.000 0.327 0.000 tag.py:8(Tag) 34100/33500 0.228 0.000 0.958 0.000 filereader.py:150(data_element_generator) 1 0.162 0.162 2.348 2.348 time_test.py:69(test_partial) 67200 0.134 0.000 0.502 0.000 tag.py:56(eq)

Or, in a python shell:

r_all = "d = dicom.read_file('s11_epi.0001')"
r_nopix = "d = dicom.read_file('s11_epi.0001', stop_before_pixels=True)" 3: timeit.timeit(r_all, "import dicom", number=500) # 2.7346301078796387 4: timeit.timeit(r_nopix, "import dicom", number=500) # 3.6994130611419678

I wouldn't be surprised to not save much time in reading these, but to have it be significantly slower? That's odd.

I'm using the version easy_install grabbed for me on December 2, 2010.

I can't post any dicoms to Google, but I could send you a few.

Cheers, -Nate

Original issue: http://code.google.com/p/pydicom/issues/detail?id=94

suever commented 9 years ago

From darcymason@gmail.com on December 03, 2010 17:24:27

I'm guessing this is because the check for stopping is done by a call-back function. Function calls are costly in python, so doing one in the innermost loop of reading the file could add a lot of time.

It might be reasonable to special case the check for pixel data element directly in the inner loop and avoid the call-back. It might require some change to the calling signature of read_file(), or some of the intermediate methods it calls.

Status: Accepted
Labels: -Type-Defect -Difficulty-Medium Type-Enhancement Difficulty-Easy

suever commented 9 years ago

From njv...@gmail.com on December 06, 2010 08:45:09

I'm curious: what other use case is there for stopping?

suever commented 9 years ago

From darcymason@gmail.com on December 06, 2010 19:40:29

I'm curious: what other use case is there for stopping? Well, if it were indeed faster, then it could make sense for any kind of filtering operation on a large set of files -- read just enough to get patient name/id to group ones for the same patient, or similarly by exam id for a certain patient, or a certain referring physician. A small list of key information (patient name, id, study comments) could be offered in a GUI for a user to choose which file(s) to view.

suever commented 9 years ago

From njv...@gmail.com on December 07, 2010 09:05:52

Makes sense.

What if you had a couple of lists, say, stop_before_tags and stop_after_all_tags -- as you read tags, you check for the tag in those two lists.

If the tag is in stop_before_tags, it stops before reading further. If it's in stop_after_all_tags, it'd remove the tag from the list, read the value, and if the list is now empty, stop reading further.

That way, you could do

stop_before_tags = [0x7FE00010] # pixel data

or

stop_after_all_tags = [0x00100010, 0x00324000] # patient's name, study comments

Of course, I'm not exactly sure what should happen if both are specified.

Also, I don't know enough about the structure of dicoms to know this, but if I stop before reading pixel data, would I risk missing other tags I might be interested in? Would it make more sense to, instead, skip reading certain tags?

suever commented 9 years ago

From darcymason@gmail.com on December 07, 2010 16:04:26

What if you had a couple of lists, say, stop_before_tags and stop_after_all_tags -- as you read tags, you check for the tag in those two lists.

Yes, I like that idea, it is reasonably straightforward and could be done entirely inside the reading loop, so it would cost very little time.

A couple of minor refinements or clarifications of the idea:

They should both be optional arguments defaulted to None, then the first (and quick) check is simply whether each is None or not, and only then the real work of checking tags.
The user should also be able to pass in values as DICOM names (keywords) rather than numeric tags. Then there is a one-time check at the beginning -- if not already a tag number, the keyword is looked up and converted to the group element tag number.

Of course, I'm not exactly sure what should happen if both are specified.

Each could just be checked individually. If either reaches the "stop" condition, the reading is stopped.

...if I stop before reading pixel data, would I risk missing other tags I might be interested in? Would it make more sense to, instead, skip reading certain tags?

There is very little that can come after pixels (see tags in DICOM dictionary > 7Fe00010). Very unlikely to want these items. Skipping has been tried, it saves no time at all compared with reading but does save memory of course. The defer_size argument handles that.

suever commented 9 years ago

From njv...@gmail.com on December 07, 2010 16:11:22

The user should also be able to pass in values as DICOM names (keywords) rather than numeric tags. Then there is a one-time check at the beginning -- if not already a tag number, the keyword is looked up and converted to the group element tag number.

Definitely.

Also, it'd probably be sane to store the tag codes, after lookup, in a set instead of a list -- no worries about dupes, and O(1) lookups.

Thanks!

suever / pydicom-experimental

read_file(f, stop_before_pixels=True) is slower than without #94