read_file(f, stop_before_pixels=True) is slower than without

GoogleCodeExporter commented 9 years ago

On my DICOMs, (2D slices from a GE MRI scanner) I've found that telling PyDICOM 
to skip the pixel data is actually slower than reading it. This is both true 
when repeatedly re-reading one file with timeit, or in running time_test.py

1. Point time_test.py at a directory full of dicoms
2. Enable only test_full_read() and test_partial(). Don't set dataset=.
3. Run a bunch of times, compare times.

test_full_read()
601937 function calls (601137 primitive calls) in 1.750 CPU seconds

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   102100    0.906    0.000    0.906    0.000 {method 'read' of 'file' objects}
34200/33600    0.211    0.000    0.656    0.000 
filereader.py:150(data_element_generator)
    35100    0.141    0.000    0.171    0.000 tag.py:8(Tag)
  400/200    0.120    0.000    1.032    0.005 filereader.py:259(read_dataset)
    33800    0.086    0.000    0.086    0.000 {built-in method __new__ of type object at 0x100126080}

test_partial()
829791 function calls (828991 primitive calls) in 2.355 CPU seconds

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   101900    1.048    0.000    1.048    0.000 {method 'read' of 'file' objects}
    67800    0.275    0.000    0.327    0.000 tag.py:8(Tag)
34100/33500    0.228    0.000    0.958    0.000 
filereader.py:150(data_element_generator)
        1    0.162    0.162    2.348    2.348 time_test.py:69(test_partial)
    67200    0.134    0.000    0.502    0.000 tag.py:56(__eq__)

Or, in a python shell:

1. r_all = "d = dicom.read_file('s11_epi.0001')"
2. r_nopix = "d = dicom.read_file('s11_epi.0001', stop_before_pixels=True)"
3: timeit.timeit(r_all, "import dicom", number=500) # 2.7346301078796387
4: timeit.timeit(r_nopix, "import dicom", number=500) # 3.6994130611419678

I wouldn't be surprised to not save much time in reading these, but to have it 
be significantly slower? That's odd.

I'm using the version easy_install grabbed for me on December 2, 2010.

I can't post any dicoms to Google, but I could send you a few.

Cheers,
-Nate

Original issue reported on code.google.com by njv...@gmail.com on 3 Dec 2010 at 4:13

GoogleCodeExporter commented 9 years ago

I'm guessing this is because the check for stopping is done by a call-back 
function. Function calls are costly in python, so doing one in the innermost 
loop of reading the file could add a lot of time.

It might be reasonable to special case the check for pixel data element 
directly in the inner loop and avoid the call-back. It might require some 
change to the calling signature of read_file(), or some of the intermediate 
methods it calls.

Original comment by darcymason@gmail.com on 4 Dec 2010 at 1:24

Changed state: Accepted
Added labels: Difficulty-Easy, Type-Enhancement
Removed labels: Difficulty-Medium, Type-Defect

GoogleCodeExporter commented 9 years ago

I'm curious: what other use case is there for stopping?

Original comment by njv...@gmail.com on 6 Dec 2010 at 4:45

GoogleCodeExporter commented 9 years ago

>> I'm curious: what other use case is there for stopping?
Well, if it were indeed faster, then it could make sense for any kind of 
filtering operation on a large set of files -- read just enough to get patient 
name/id to group ones for the same patient, or similarly by exam id for a 
certain patient, or a certain referring physician. A small list of key 
information (patient name, id, study comments) could be offered in a GUI for a 
user to choose which file(s) to view.

Original comment by darcymason@gmail.com on 7 Dec 2010 at 3:40

GoogleCodeExporter commented 9 years ago

Makes sense.

What if you had a couple of lists, say, stop_before_tags and 
stop_after_all_tags -- as you read tags, you check for the tag in those two 
lists.

If the tag is in stop_before_tags, it stops before reading further. If it's in 
stop_after_all_tags, it'd remove the tag from the list, read the value, and if 
the list is now empty, stop reading further.

That way, you could do

stop_before_tags = [0x7FE00010] # pixel data

or

stop_after_all_tags = [0x00100010, 0x00324000] # patient's name, study comments

Of course, I'm not exactly sure what should happen if both are specified.

Also, I don't know enough about the structure of dicoms to know this, but if I 
stop before reading pixel data, would I risk missing other tags I might be 
interested in? Would it make more sense to, instead, *skip* reading certain 
tags?

Original comment by njv...@gmail.com on 7 Dec 2010 at 5:05

GoogleCodeExporter commented 9 years ago

> What if you had a couple of lists, say, stop_before_tags and 
stop_after_all_tags -- as you read tags, 
> you check for the tag in those two lists.

Yes, I like that idea, it is reasonably straightforward and could be done 
entirely inside the reading loop, so it would cost very little time.

A couple of minor refinements or clarifications of the idea:
1. They should both be optional arguments defaulted to None, then the first 
(and quick) check is simply whether each is None or not, and only then the real 
work of checking tags. 
2. The user should also be able to pass in values as DICOM names (keywords) 
rather than numeric tags. Then there is a one-time check at the beginning -- if 
not already a tag number, the keyword is looked up and converted to the group 
element tag number. 

>Of course, I'm not exactly sure what should happen if both are specified.

Each could just be checked individually. If either reaches the "stop" 
condition, the reading is stopped.

> ...if I stop before reading pixel data, would I risk missing other tags I 
might be interested in? 
> Would it make more sense to, instead, *skip* reading certain tags?

There is very little that can come after pixels (see tags in DICOM dictionary > 
7Fe00010). Very unlikely to want these items. 
Skipping has been tried, it saves no time at all compared with reading but does 
save memory of course. The defer_size argument handles that.

Original comment by darcymason@gmail.com on 8 Dec 2010 at 12:04

GoogleCodeExporter commented 9 years ago

> 2. The user should also be able to pass in values as DICOM names (keywords) 
rather than numeric tags. Then there is a one-time check at the beginning -- if 
not already a tag number, the keyword is looked up and converted to the group 
element tag number. 

Definitely.

Also, it'd probably be sane to store the tag codes, after lookup, in a set 
instead of a list -- no worries about dupes, and O(1) lookups.

Thanks!

Original comment by njv...@gmail.com on 8 Dec 2010 at 12:11

thegooglecodearchive / pydicom

read_file(f, stop_before_pixels=True) is slower than without #94