Closed pohutukawa closed 1 year ago
Dear Dr. Guy Kloss,
This is a good question, involving the parameter optimization of this method.
I would like to present the entire procedure for the function where a ValueError occurred, i.e., "_searching_results". In scheme.py, between Lines 388 and 404, we aim to find a paired bit segment from candidates "other_lists" that meets the constraints after a single encoding process based on a given bit segment "fixed_list" within a specified limit of candidate search "self.search_count". However, there may be cases where we are unable to find a suitable paired bit segment within a limited number of searches. In such cases, as illustrated in Lines 405 to 435 of scheme.py, we resort to generating a random bit segment to complete this search. For further information on the design and simulation details, please refer to our paper.
At Line 424 in scheme.py, the objective is to create an index region for a randomly generated bit segment. There is a requirement that this generated bit segment should not impact the data recovery process (refer to Lines 132-181 in utils.index_operator.py). Specifically, the index of this generated bit segment should be at least two greater than the maximum index of all other bit segments. This is because the loss of two consecutive sequences is unlikely. Under this requirement, the range of "random.randint" may cause ValueError because the inputted minimum value (total_count + 3) is greater than or equal to the inputted maximum value (2^l, where l is the length of index region).
There are three possible solutions to this issue (or the rapid encoding issues):
If you have any further questions about this, please feel free to let me know.
Cheers, Haoling
Hi Haoling,
thanks for the quick response, and apologies from my late reply on this (I was off for a week). I had started to read your paper 'Towards practical and robust DNA-based data archiving using the yin-yang codec system' already. I was in parallel trying to get a 'feel' for it by using your implementation. Also, as I was not interested in file-based data encoding, I had modified your implementation in a personal fork to support generic streams for data i/o, allowing me for an easier experimentation in memory:
https://github.com/pohutukawa/DNA-storage-YYC/tree/feature/generic-streams
I have just had a look at Chamaeleo, which seems much more capable and universal with different encoders. Very interesting. But unfortunately the GitHub code in the master
branch does not seem to (fully) include the yin-yang codec, and the description provided in tutorial.rst
cannot be reproduced (also the codec_factory
is missing and thus not importable). I'd suppose there may be some things still missing in merging from other (private) branches.
I will continue reading your paper, and see if I can make do with this code base. But I'd be nonetheless be interested in Chamaeleo (if it was to get fixed soon). I can possibly (again) also contribute generic stream handling to it, too.
Cheers,
Guy
I have found the specific piece in Chamaelo that allows me to exercise the YinYangCodec
. I have made a personal feature branch that implements parameters input_stream
and output_stream
to pipeline.transcode()
(together with an example file):
https://github.com/pohutukawa/Chamaeleo/tree/feature/generic-streams
Now I'll see how this works for the different payloads (as indicated above).
Just trying it with Chamaeleo, I can see that the issue seems to be the same. So I cannot create an encoding when the data does not meet the minimum requirement given by the segment_size
. And if I have quite regularly structured data (e.g. multiples of the byte sequence b"foo bar baz "
, then the code runs into issues with quickly requiring excessive time. Whereas if I use high entropy data (e.g. random byte sequences), the transcoder works quite efficiently.
I suppose these things just need to be known for usage (if the above assumptions are correct). So one may not want to (blindly) apply the transcoder to very structured files with low entropy. It may be worth documenting that (both in this repo as well as in Chamaelio).
Though, I would be keen to find out if there are ways to mitigate this.
Also, further discussion should also likely then be conducted on the Chamaelio repo :-)
Hi Guy,
A rapid solution here: the measure for this scenario is to compress digital data in advance. DNA-fountain (Erlich et al. Science, 2017) adopted this measure. Compressing digital data can better balance the data pattern at the byte-frequency level. Thus, multiple repetitions of a certain binary message are less likely to occur.
It may require refactoring the code instead of simple renovations to achieve your potential objective (maybe hard work), so the following contents are some comments for you.
As far as I'm concerned, it seems that your objective is to modify YinYang Code from a file-based coding system to a stream-based coding system. From a computer standpoint, it is a commendable optimization objective as it could pave the way for real-time information writing. Nevertheless, the sluggish pace of DNA synthesis stands as a major obstacle to realizing real-time processing. In contrast, the time needed to convert digital data into DNA strings is relatively minor. This is why many well-established coding systems treat a file as a data package during the encoding process.
There are generally two categories of methods: constrained coding scheme and screen-based coding scheme (e.g., DNA-fountain and YinYang Code). The latter may have excessive time consumption issues without parameter tuning, as you mentioned above.
If you want to create a stream-based screen-based coding scheme, you may need a queue to store continuously incoming bit/byte streams. There are some programming suggestions.
If you have any further questions about this, please feel free to let me know
Cheers, Haoling
By the way, if you are interested in collaborating with us to promote the adaptation of YinYang Code on the stream-based coding scheme, please feel free to contact Dr. Zhi Ping and cc me (pingzhi[at]genomics.cn, zhanghaoling[at]genomics.cn).
Haoling
Kia ora Haoliing. I've sent an email.
I have been throwing a log of (very short) data at the encoder to see how it behaves after seeing issues in our project with it. We only have rather short data sequences that need to be encoded, so it was quite easy to run it in a loop and see how it behaves.
In many cases
ValueError
s were raised with an empty range exception forrandrange()
, e.g.empty range for randrange() (10, 8, -2)
. The error originated in this code bit:Far more seldomly, but still in many cases the code was getting trapped within an infinite loop (or at least one that spun around and did not finish for minutes). This loop has always appeared within
YYC._list_to_sequence
during adeepcopy()
operation.Any insights on what may be going wrong here? Or how it could be fixed?